Artificial Intelligence (AI) & Machine Learning (ML) - Ģ��Ƶ

Nightmare on LLM Street: How To Prevent Poor Data Haunting AI

Matt Flenley — Fri, 25 Oct 2024 14:34:36 +0000

It’s October, the Northern Hemisphere nights are drawing in, and for many it’s time when things take a scarier turn. But for public sector leaders exploring AI, that fright need not apply to your data. It definitely shouldn’t be something that haunts your digital transformation dreams.

With a reported £800m budget unveiled by the previous government to address ‘digital and AI’, UK public sector departments are keen to be the first to explore the sizeable benefits that AI and automation offer. The change of government in July 2024 has done nothing to indicate that this drive has lessened in any way; in fact, the Labour manifesto included the commitment to a “single unique identifier” to “better support children and families”[1].

While we await the first Budget of this Labour government, it’s beyond doubt that there is an urgent need to tackle this task amid a cost-of-living crisis, with economies still trying to recover from the economic shock of COVID and deal with energy price hikes amid several sizeable international conflicts.

However, like Hollywood’s best Halloween villains, old systems, disconnected data, and a lack of standardisation are looming large in the background.

Acting First and Thinking Later

It’s completely understandable that the pressures would lead us to this point. Societal expectations from the emergence of ChatGPT, among others, have only fanned the flames, swelling the sense that technology should just ‘work’ and leading to an overinflated belief in what is possible.

Recently, LinkedIn attracted some consternation[i][2] by automatically including members’ data in its AI models without seeking express consent first. For whatever reason, the idea that people would just accept this change was overlooked. It took the UK’s Information Commissioner’s Office, the ICO, to intervene for the change to be withdrawn – in the UK, at least.

A dose of reality is the order of the day. Government systems are lacking integrated data, and clear consent frameworks of the type that LinkedIn actually possesses seldom exist in one consistent way. Already lacking funds, the public sector needs to act carefully, and mindfully, to prevent their AI experiments (which is, after all, what they are) from leading to inaccuracies and wider distrust from the general public.

One solution is for Government departments to form one, holistic set of consents concerning use of data for AI, especially Large Language Models and Generative AI – similar to communication consents under the General Data Protection Regulation, GDPR.

The adoption of a flexible consent management policy, one which can be updated and maintained for future developments and tied to an interoperable, standardised single view of citizen (SCV), will serve to support the clear, safe development of AI models into the future. The risks of building models now, on shakier foundations, will only serve to erode public faith. The evidence of the COVID-era exam grades fiasco[3] demonstrates the risk that these models present to real human lives.

Of course, it’s not easy to do. Many legacy systems contain names, addresses and other citizen data in a variety of formats. This makes it difficult to be sure that when more than one dataset includes a particular name, that name actually refers to the same individual. Traditional solutions to this problem use anything from direct matching technology to the truly awful exercise of humans manually reviewing tens of thousands of records in spreadsheets. This is one recurring nightmare that society really does need to stop having.

Taking Refuge in Safer Models

Intelligent data matching uses a variety of matching algorithms and well-established machine learning techniques to reconcile data held in old systems, new ones, documents, even voice notes. Such approaches could help the public sector to streamline their SCV processes, managing consents more effectively. The ability to understand who has opted in, marrying opt-ins and opt-outs to demographic data is critical. This approach will help model creators to interpret the inherent bias in the models built on those consenting to take part, to understand how reflective of society the predictive models are likely to be – including whether or not it is actually safe to use the model at all.

It’s probable that this transparency in process could also lead to greater trust in the general public to take part in data sharing in this way. In the LinkedIn example, the news that data was being used without explicit consent, raced around like wildfire on the platform itself. This sort of outcome cannot be what LinkedIn anticipated, which in and of itself is a concern about the mindset of the model creators.

It Doesn’t Have to Be a Nightmare

It’s a spooky enough season without adding more fear to the bonfire; certainly, this article isn’t intended as a reprimand. The desire to save time and money to deliver better services to a country’s citizens is a major part of many a civil servant’s professional drive. And AI and automation offer so many opportunities for much better outcomes! For just one example, NHS England’s AI tool already uses image recognition to detect heart disease up to 30 times faster than a human[4] . Mid and South Essex (MSE) NHS Foundation used a predictive analytical machine learning model called Deep Medical to reduce the rate at which patients either didn’t attend appointments or cancelled with short notice (referred to as Did Not Attend, or DNA). Its pilot project identified which patients were more likely to fall into the DNA category, developed personalised reminder schedules, and through identifying frail patients who were less likely to attend an appointment, highlighted them to relevant clinical teams.[5]

The time for taking action is now. Public sector organisations, government departments and agencies should focus on the need to develop systems that will preserve and maintain trust in the AI-led future. This blog has shown that better is possible, through a dedicated desire to align citizen data and their consents to contact. In a society where people have trust and transparency in the ways that their data will be used to train AI, the risk of nightmare scenarios can be averted and we’ll all sleep better at night.

[3] .

The post Nightmare on LLM Street: How To Prevent Poor Data Haunting AI appeared first on Ģ��Ƶ.

AI/ML Scalability with Kubernetes��

Hubert Skladanowski — Wed, 05 Jun 2024 13:34:51 +0000

Kubernetes: An Introduction��

In the ever-evolving world of engineering, scalability isn’t just a feature—it’s a necessity. As businesses and data continue to grow, the ability to scale applications efficiently becomes critical. At Ģ��Ƶ, we are at the forefront of integrating cutting-edge AI/ML functionality that enhances our Augmented Data Quality solutions. To align with current standards and ensure optimal AI/ML scalability with Kubernetes, our AI/ML team has integrated K8s into our infrastructure and deployment strategies.

What is Kubernetes?

Kubernetes, also known as K8s, is an open-source platform designed to automate the deployment, scaling, and management of containerised applications. It adjusts the number of containerised applications to match incoming traffic, ensuring adequate resources to handle requests seamlessly.

Docker containers, managed through an API layer often using FastAPI, function like fully equipped packages of software, including all necessary dependencies. Kubernetes enables ‘horizontal scaling’—increasing or decreasing the number of container instances based on demand—using various load balancing and rollout strategies to make the process appear seamless. This method helps evenly spread traffic among containers, preventing overload and optimising resources.

Kubernetes for Data Management

Every day, companies handle a lot of complicated data from different sources, at different velocities, and scales. This includes important tasks like cleaning, combining, matching, and resolving errors. It’s crucial to suggest and enforce Data Quality (DQ) rules in your data pipelines and efficiently identify DQ issues, ensuring these processes are automated, scalable, and responsive to fluctuating demands.

Many organisations use Kubernetes (K8s) to automate deploying, scaling, and managing applications in containers across multiple machines. With features like service discovery, load balancing, self-healing, automated rollouts, and rollbacks, Kubernetes has become a standard for managing applications that are essential for handling complex data—both in the cloud and on-premise. Implementing AI/ML scalability with Kubernetes allows these organisations to process large volumes of data efficiently and respond quickly to changes in data flow and processing demands.

Real-World Scenario: The Power of Kubernetes

It’s Friday at 5pm, and just as you’re about to leave the office, your boss informs you that transaction data for last month has been uploaded to the network share in a CSV document and it needs to be profiled immediately. The CSV file is massive—about a terabyte of data—and trying to open it in Excel would be disastrous. This is where Ģ��Ƶ and Kubernetes come to the rescue.

You could run a Python application that might take all weekend to process, meaning you’d have to keep checking its progress and your weekend would be ruined. Instead, you could use Kubernetes to scale out Ģ��Ƶ’ powerful Profiling tools and complete the profiling before you even leave the building. Company saved. Weekend saved.

Application of Kubernetes

The world has grown progressively faster, and speed in the digital realm is king: speed in service delivery, speed in recovery in the event of a failure, and speed to production. We believe that the AI/ML features offered by Ģ��Ƶ should adhere to the same high standards. No matter how much data your organisation handles or how many data sources��there are, it’s important to adjust resources to meet demand and reduce waste during the most critical moments.��

At Ģ��Ƶ, AI/ML features are deployed as Docker containers and FastAPI. Depending on your particular environment, we might run these containers on a single machine like AWS EC2 and deploy a single instance of each AI/ML feature, which is suitable for experiments and proof of concepts. However, for a fully operational infrastructure capable of supporting a large organisation, Kubernetes is essential.

Kubernetes helps deploy Docker containers by providing a blueprint with deployment details, necessary resources, and any dependencies like external storage. This blueprint facilitates horizontal scaling to support additional instances of each AI/ML feature.

Conclusion

Kubernetes proved to be a game-changer for scaling Ģ��Ƶ’ AI/ML services, ultimately leading to a robust solution that ensures our AI/ML features can dynamically scale according to client needs. We tailor our deployment strategies to meet the diverse needs of our clients. Whether the requirement is a simple installation or a complex, scalable infrastructure, our commitment is to provide solutions that ensure our clients’ applications are efficient, reliable, and scalable.

We aim to meet any specific requirements, always exploring various potential deployment setups preferred by our clients. If your organisation is looking to enhance its data processing capabilities, . Let us help you optimise your data management strategies with the power of Kubernetes and our innovative AI/ML solutions.

The post AI/ML Scalability with Kubernetes�� appeared first on Ģ��Ƶ.

The benefits of an Augmented Data Quality Solution

Fiona Browne — Mon, 22 Jan 2024 15:55:40 +0000

In the digital era, data is essential for every organisation, meaning good data management is needed to empower businesses to make well-informed decisions and operate efficiently. However, this can be a challenging landscape, encompassing catalogs, lineage, observability, master data management, and data quality.��

We’re at a point now where institutions’ data estates are rapidly expanding. Stretching from legacy systems to cloud migrations and data warehouses, and spanning relational databases to unstructured documents, the importance of data quality has never been greater. This, coupled with the decentralisation of organisational data, has made it difficult for organisations to maintain good data quality.��

��

From traditional to transformative Data Quality Solutions��

Addressing data quality issues within a business has typically involved very labour-heavy, manual processes. The nature of the modern data landscape, with its complex and ever-growing data sets, is demanding much more in the way of transformative solutions. Consequently, data quality systems must now adapt to automate processes like data profiling, rule suggestion, and time-series analysis of data issues. This is where the revolutionary concept of ‘augmented data quality’ comes into play.��

��

Augmented Data Quality- What is it?��

In short, augmented data quality is an approach that uses machine learning (ML) and artificial intelligence (AI) to automate and enhance data quality management. The aim is to automatically improve data quality by analyzing data, identifying and fixing issues, and providing clear, transparent metrics on data quality and improvement actions across your entire data estate. As a result, our users have found that an augmented data quality approach makes their data assets more valuable, allowing them to maximise the value of their data at a low cost with minimal manual effort.��

��Augmented data quality promotes self-service data quality management, making it easier for business users to carry out tasks without the need for deep technical expertise and knowledge of data science techniques. Moreover, it offers many benefits, from improved data accuracy to increased efficiency, and reduced costs. Rather than needing to carry out many specific tasks when assessing the quality of a set of data, augmented data quality automates this process, making it a valuable resource for enterprises dealing with big data.��

��Whilst AI and machine learning models can speed up routine DQ tasks, they cannot fully automate the whole process. In other words, augmented data quality does not eliminate the need for human oversight, decision-making, and intervention; instead, it complements it by leveraging human-in-the-loop technology, using advanced algorithms to perform large amounts of checks and fixes while making use of human expertise to review and tackle only the most difficult of issues, ensuring the highest levels of accuracy.��

��

Ģ��Ƶ Augmented Data Quality Platform��

Responding to these challenges, Ģ��Ƶ has developed the Augmented Data Quality platform (ADQ), which streamlines the data quality journey through a user-friendly interface. Our technology team has pioneered the use of AI/ML capabilities to make it easier for businesses to improve data quality. This includes:��

Automated Data Profiling: Enabling you to efficiently onboard new sources of data or analyse existing ones, this feature allows the user to quickly understand their data, identify trends and outliers, and, when errors are found, automatically suggest and apply data quality rules.��
DQ Insights Hub: Making use of a wide range of our machine learning capabilities, this feature provides a summarised view of data quality across many sources, allowing you to create interactive and fully customizable dashboards. These dashboards highlight and track many DQ metrics, from the number of issues found with each data element to the average time it takes for these issues to be remediated and then re-occur again.
Predictive Features:  We’ve developed a bespoke machine learning algorithm that learns from your data quality issues, allowing you to gain a deeper understanding of the root causes of the problems and empowering you to take preventative measures to ensure they don’t reoccur. By training this exclusively on your data, you get the most accurate predictions whilst also ensuring your data is fully secure.��

Benefits of the Ģ��Ƶ ADQ platform

These represent tangible benefits for our users. At the heart of ADQ’s success is the new user layer that simplifies all the key components of a good data quality solution, such as connectivity, integrations, rule authoring, remediation, and insights. Essentially providing a pragmatic and practical real-world understanding of data quality

The Ģ��Ƶ platform is designed with all levels of users in mind. ADQ’s interface is intuitive and user-friendly, ensuring that users, regardless of their technical proficiency, can easily navigate and utilise the platform to its full potential. With support for a spectrum of different technologies, ADQ is the perfect platform for any user, from a non-technical business user to expert data scientists. This approach democratises data quality management, making it accessible and manageable for a wider range of professionals within an organisation.��

The practical benefits of ADQ are evident in our client testimonials, with users reporting significant reductions in cost and time associated with building data quality projects. Specifically, the rule suggestion feature has been a game-changer for many, identifying a substantial portion of business rules which results in considerable time savings. Essentially, it provides a pragmatic and practical real-world understanding of data quality.��

Empowering Organisations with Data��

In the future, we plan to enhance ADQ with more automated features, better insights, and additional integrations. Some of the new features upcoming this year include incorporating generative AI into the platform, allowing non-technical users to create data quality checks using natural language prompts. Suggestions for remediations, generated using historical fixes and our bespoke machine learning algorithm, will vastly boost the number of issues that can be automatically resolved, decreasing the likelihood of human error and leaving your data stewards free to tackle the most critical and problematic cases. Additionally, by enhancing our predictive capabilities, we will allow you to pre-emptively act before data quality issues occur, ensuring your organisation is always working with high quality data.��

��The release of ADQ marks a significant milestone at Ģ��Ƶ, in terms of innovation and supporting our customers. It embodies our commitment to providing state-of-the-art data management solutions, enabling organisations to fully leverage their data assets. We are proud of our team’s vision and dedication to delivering a platform that not only addresses current data quality challenges but also paves the way for future innovations.��

For more information about the Ģ��Ƶ ADQ solution, take a look at this piece by ��or reach out to us at www.datactics.com.��

��

The post The benefits of an Augmented Data Quality Solution appeared first on Ģ��Ƶ.

The Importance of Data Quality in Machine Learning

Fiona Browne — Mon, 18 Dec 2023 12:40:03 +0000

We are currently in an exciting area and time, where Machine Learning (ML) is applied across sectors from self driving cars to personalised medicine. Although ML models have been around for a while – for example, the use of algorithmic trading models from the 80’s, Bayes since 1700s – we are still in the nascent stages of productionising ML.

From a technical viewpoint, this is ‘Machine Learning Ops’ or MLOPs. MLOPs involve figuring out how to build, deploy via continuous integration and deployment, tracking and monitoring models and data in production.��

From a human, risk, and regulatory viewpoint we are grappling with big questions about ethical AI (Artificial Intelligence) systems and where and how they should be used. Areas including risk, privacy and security of data, accountability, fairness, adversarial AI, and what this means, all come into play in this topic. Additionally, the debate over supervised machine learning, semi-supervised learning, and unsupervised machine learning, brings further complexity to the mix.

Much of the focus is on the models themselves, such as��Everyone can get their hands on pre-trained models or licensed APIs; What differentiates a good deployment is the data quality.

However, the one common theme that underpins all this work, is the rigour required in developing production-level systems and especially the data necessary to ensure they are reliable, accurate, and trustworthy. This is especially important for ML systems; the role that data and processes play; and the impact of poor-quality data on ML algorithms and learning models in the real world.

Data as a common theme��

If we shift our gaze from the model side to the data side, including:

Data management – what processes do I have to manage data end to end, especially generating accurate training data?
Data integrity – how am I ensuring I have high-quality data throughout?
Data cleansing and improvement – what am I doing to prevent bad data from reaching data scientists?
Dataset labeling – how am I avoiding the risk of unlabeled data?
Data preparation – what steps am I taking to ensure my data is data science-ready?

A far greater understanding of performance and model impact (consequences) could be achieved. However, this is often viewed as less glamorous or exciting work and, as such, is often unvalued. For example, what is the impetus for companies or individuals to invest at this level (such as regulatory – e.g. BCBS, financial, reputational, law)?

Yet, as well defined in

“Data largely determines performance, fairness, robustness, safety, and scalability of AI systems…[yet]��In practice, most organizations fail to create or meet any data quality standards, from under-valuing data work vis-a-vis model development.”��

This has a direct impact on people’s lives and society, where “…data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations”.

What this looks like in practice

We have seen this in the past, with the in the UK during Covid. In this case, teachers predicted the grades of their students, then an algorithm was applied to these predictions to downgrade any potential grade inflation by the Office of Qualifications and Examinations Regulation, using an algorithm. This algorithm was quite complex and non-transparent in the first instance. When the results were released, 39% of grades were downgraded. The algorithm captured the distribution of grades from previous years, the predicted distribution of grades for past students, and then the current year.

In practice, this meant that if you were a candidate who had performed well at GCSE, but attended a historically poor performing school, then it was challenging to achieve a top grade. Teachers had to rank their students in the class, resulting in a relative ranking system that could not equate to absolute performance. It meant that even if you were predicted a B, were ranked at fifteenth out of 30 in your class, and the pupil ranked at fifteenth the last three years received a C, you would likely get a C.

The application of this algorithm caused an uproar. Not least because schools with small class sizes – usually private, or fee-paying schools – were exempt from the algorithm resulting in the use of the teaching predicted grades. Additionally, it baked in past socioeconomic biases, benefitting underperforming students in affluent (and previously high-scoring) areas while suppressing the capabilities of high-performing students in lower-income regions.

A major lesson to learn from this, therefore, was transparency in the process and the data that was used.

An example from healthcare

Within the world of healthcare, it had an impact on ML cancer prediction with IBM’s ‘Watson for Oncology’, partnering with The University of Texas MD Anderson Cancer Center in 2013 to “uncover valuable insights from the cancer center’s rich patient and research databases”. The system was trained on a small number of hypothetical cancer patients, rather than real patient data. This resulted in erroneous and dangerous cancer treatment advice.

Significant questions that must be asked include:

Where did it go wrong here – certainly the data but in general a wider AI system?
Where was the risk assessment?
What testing was performed?
Where did responsibility and accountability reside?

Machine Learning practitioners know well the statistic that 80% of ML work is data preparation. Why then don’t we focus on this 80% effort and deploy a more systematic approach to ensure data quality is embedded in our systems, and considered important work to be performed by an ML team?

This is a view recently articulated by who urges the ML community to be more data-centric and less model-centric. In fact, Andrew was able to demonstrate this using a steel sheets defect detection prediction use case whereby a deep learning computer vision model achieved a baseline performance of 76.2% accuracy. By addressing inconsistencies in the training dataset and correcting noisy or conflicting dataset labels, the classification performance reached 93.1%. Interestingly and compellingly from the perspective of this blog post, minimal performance gains were achieved addressing the model side alone.

Our view is, if data quality is a key limiting factor in ML performance –then let’s focus our efforts here on improving data quality, and can ML be deployed to address this? This is the central theme of the work the ML team at Ģ��Ƶ undertakes. Our focus is automating the manual, repetitive (often referred to as boring!) business processes of DQ and matching tasks, while embedding subject matter expertise into the process. To do this, most of our solutions employ a human-in-the-loop approach where we capture human decisions and expertise and use this to inform and re-train our models. Having this human expertise is essential in guiding the process and providing context improving the data and the data quality process. We are keen to free up clients from manual mundane tasks and instead use their expertise on tricky cases with simpler agree/disagree options.

To learn more about an AI-driven approach to Data Quality, read our press release about our Augmented Data Quality platform here.��

The post The Importance of Data Quality in Machine Learning appeared first on Ģ��Ƶ.

How to test your data against Benford’s Law��

Matt Neill — Tue, 09 May 2023 16:04:04 +0000

One of the most important aspects of data quality is being able to identify anomalies within your data. There are many ways to approach this, one of which is to test the data against Benford’s Law. This blog will take a look at what Benford’s Law is, how it can be used to detect fraud, and how the Ģ��Ƶ platform can be used to achieve this.

What is Benford’s Law?��

Benford’s law is named after a physicist called Frank Benford and was first discovered in the 1880s by an astronomer named Simon Newcomb. Newcomb was looking through logarithm tables (used before pocket calculators were invented to find the value of the logarithms of numbers), when he spotted that the pages which started with earlier digits, like 1, were significantly more worn than other pages.��

Given a large set of numerical data, Benford’s Law asserts that the first digit of these numbers is more likely to be small. If the data follows Benford’s Law, then approximately 30% of the time the first digit would be a 1, whilst 9 would only be the first digit around 5% of the time. If the distribution of the first digit was uniform, then they would all occur equally often (around 11% of the time). It also proposes a distribution of the second digit, third digit, combinations of digits, and so on.��According to Benford’s Law, the probability that the first digit in a dataset is d is given by P(d) = log10(1 + 1/d).

Why is it useful?��

There are plenty of data sets that have proven to have followed Benford’s Law, including stock prices, population numbers, and electricity bills. Due to the large availability of data known to follow Benford’s Law, checking a data set to see if it follows Benford’s Law can be a good indicator as to whether the data has been manipulated.��While this is not definitive proof that the data is erroneous or fraudulent, it can provide a good indication of problematic trends in your data.��

In the context of fraud, Benford’s law can be used to detect anomalies and irregularities in financial data. For example, within large datasets such as invoices, sales records, expense reports, and other financial statements. If the data has been fabricated, then the person tampering with it would probably have done so “randomly”. This means the first digits would be uniformly distributed and thus, not follow Benford’s Law.

Below are some real-world examples where Benford’s Law has been applied:

Detecting fraud in financial accounts – Benford’s Law can be useful in its application to many different types of fraud, including money laundering and large financial accounts. Many years after Greece joined the eurozone, the economic data they provided to the E.U.

Detecting election fraud – Benford’s Law was used as evidence of fraud in the 2009 Iranian elections and was also used for auditing data from the 2009 German federal elections. Benford’s Law has also been used in multiple US presidential elections.

Analysis of price digits – When the euro was introduced, all the different exchange rates meant that, while the “real” price of goods stayed the same, the “nominal” price (the monetary value) of goods was distorted. Research carried out across Europe showed that the first digits of nominal prices followed Benford’s Law. However, deviation from this occurred for the second and third digits. Here, trends more commonly associated with psychological pricing could be observed. Larger digits (especially 9) are more commonly found due to the fact that prices such as £1.99 have been shown to be more associated with spending £1 rather than £2.��

How can Ģ��Ƶ’ tools be used to test for Benford’s Law?��

Using the Ģ��Ƶ platform, we can very easily test any dataset against Benford’s Law. Take this dataset of financial transactions (shown below). We’re going to be testing the “pmt_amt” column to see if it follows Benford’s Law for first digits. It spans several orders of magnitudes ranging from a few dollars to 15 million, which means that Benford’s Law is more likely to accurately apply to it.

The first step of the test is to extract the first digit of the column for analysis. This can very easily be done using a small FlowDesigner project (shown below).

Here we import the dataset and then filter out any values that are less than 1, as these aren’t relevant to our analysis. Then, we extract the first digit. Once that’s been completed, we can profile these digits to find out how many times each occurs and then save the results.

The next step would be to perform a statistical test to see how confident we can be that Benford’s Law applies here. We can use our Data Quality Manager tool to architect the whole process.

Step one runs our FlowDesigner project, whilst the second executes a simple Python script to perform the test and the last two steps let us set up an automated email alert to let the user know if the data failed the test at a specified threshold. While I’m using an email alert here, any issues tracking platform, such as Jira, can be used. We can also show the results in a dashboard, like the one below.

The graph on the left, with the green line, represents the distribution we would expect the digits to follow if it obeyed Benford’s Law. The red line shows the actual distribution of the digits. The bottom right table shows the two distributions and then the top right table shows the result of the test. In this case, it shows that we can be 100% confident that the data follows Benford’s Law.

In conclusion…

Physicist Frank Benford discovered a useful methodology that is as beneficial today as ever. The applicability of Benford’s law is a powerful tool for detecting fraud and other irregularities in large datasets. By combining statistical analysis with expert knowledge and AI-enabled technologies, organizations can improve their ability to detect and prevent fraudulent activities, thus safeguarding their financial health and reputation.

Matt Neil is a Machine Learning Engineer at Ģ��Ƶ. For more insights from��Ģ��Ƶ,��find us on��,��or��.

The post How to test your data against Benford’s Law�� appeared first on Ģ��Ƶ.

Battling Bias in AI: Models for a Better World

Daniel Browne — Mon, 27 Jun 2022 14:37:52 +0000

The role of synthetic data

At Ģ��Ƶ, we develop and maintain a number of internal tools for use within the AI and Software Development teams. One of which is a synthetic data generation tool that can be used to create large datasets of placeholder information. This was initially built to generate benchmarking datasets as another method of evaluating software performance, but has also been used to generate sample datasets for use in software demonstrations and for building proof-of-concept solutions. The project has been hugely beneficial, providing tailor-made, customisable datasets for each specific use case, with control over dataset size, column datatype, duplicate entries, and even insertion of simulated errors to mimic the uncleanliness of some real-world datasets. As this tool has seen increased usage, we’ve discussed and considered additional areas within data science that can benefit from the application of synthetic data.

One such area is in the training of machine learning models. Synthetic data and synthetic data generation tools such as the Synthetic Data Vault have already seen widespread usage in the AI/ML space. A report from Gartner has gone as far as to predict that– and understandably so.

Sourcing data for Artificial Intelligence models

Creating implementations of technologies such as deep learning can require massive datasets. Sourcing large, comprehensive, clean, well-structured datasets for training models is a lengthy, expensive process, which is one of the main barriers of entry to the space today. To generate synthetic data for use in place of real-world data opens the door to the world of AI/ML to many teams and researchers that would have otherwise been unable to explore it. This can lead to accelerated innovation in the space, and faster implementation of AI/ML technologies in the real world.

The use of synthetic data can clearly reduce the impact of many of the struggles faced during ML model building, however, there remains some potential flaws in ML models that cannot simply be solved by replacing real-world data with synthetic data, such as bias.

The risk of bias: a real world-example

There’s no doubt that using raw real-world data in certain use cases can create heavily biased models, as this training data can reflect existing biases in our world. For example, a number of years ago, Amazon began an internal project to build a natural language processing model to parse through the CVs of job applicants to suggest which candidates to hire. Thousands of real CVs submitted by prior applicants were used as training data, labelled by whether or not they were hired.

The model trained using this data began to reflect inherent biases within our world, within the tech industry, and within Amazon as a company, resulting in the model favouring male candidates over others, and failing to close the gender gap in recruitment at the company. A candidate would be less likely to be recommended for hiring by the model if their CV contained the word “women’s”, or mention of having studied at either of two specific all-women’s colleges. This model’s training data was not fit for purpose, as the model it produced reflected the failures of our society, and would have only perpetuated these failures had it been integrated into the company’s hiring process.

It’s important to note where the issue lies here: Natural Language Processing as a technology was not at fault in this scenario – it simply generated a model that reflected the patterns in the data it was provided. A mirror isn’t broken just because we don’t like what we see in it.

For a case such as this, generating synthetic training data initially seems like an obvious improvement over using real data to eliminate the concern over bias in the model entirely. However, synthetic training data must still be defined, generated, analysed, and ultimately signed off on for use by someone, or some group of people. The people that make these decisions are humans, born and raised in a biased world, as we all are. We unfortunately all have unconscious biases, formed from of a lifetime of conditioning by the world we live in. If we’re not careful, synthetic data can reflect the biases of the engineer(s) and decision maker(s) specifically, rather than the world at large. This raises the question – which is more problematic?

Will bias always be present?

As a simple point to analyse this from, let’s look at a common learning example used for teaching the basics of building an ML model – creating a salary estimator. In a standard exercise, we can use features like qualification level, years of experience, location, etc.. This doesn’t include names, gender, religion, or any other protected characteristic. With the features we use, you can’t directly determine any of this information. Can this data still reflect biases in our world?

A synthetic training dataset can still reflect the imperfect world we live in, because the presumptions and beliefs of those that ultimately sign off on the data can be embedded into it. Take, for instance, the beliefs of the executive team at a company like Airbnb. They’ve recently abolished location-based pay grades within the company, as they believe that an employee’s work shouldn’t be valued any differently based on their location – if they’re willing to pay an employee based in San Francisco or New York a certain wage for their given contribution to the team, an employee with a similar or greater level of output based in Iowa, Ireland or India shouldn’t be paid less, simply because average income or cost of living where they live happens to be lower.

If synthetic training data for a salary estimation model were to be analysed and approved by someone that had never considered or disagreed with this point of view, the resulting model could be biased against those that don’t live in areas with high average income and cost of living, as their predicted salary would likely be lower than someone with identical details that lived in a different area.

Similarly, returning to the example of Amazon’s biased CV-scanning model, if we were to generate a diverse and robust synthetic dataset to eliminate gender bias in a model, there’s still danger of ml algorithms favouring candidates based on “prestige” of universities, for example. As seen with the news of wealthy families paying Ivy League universities to admit their children, this could be biased in favour of people from affluent backgrounds, people that are more likely to benefit from generational wealth, which can continue to enforce many of the socioeconomic biases that exist within our world.

Additionally, industries such as tech have a noteworthy proportion of the workforce that, despite having a high level of experience and expertise in their respective field, may not have an official qualification from a university or college, having learned from real-world industry experience. A model that fails to take this into account is one with an inherent bias against such workers.

How do we eliminate bias?

As these examples show, eliminating bias isn’t as simple as removing protected characteristics, or ensuring an equal balance of instances of possible values for these features. Trends and systems in our world that reflect the imperfections and biases that exist in it may not show it explicitly, and the beliefs in ways that these systems should fundamentally operate at all can vary wildly from person to person. It presents us with an interesting issue moving forwards – If, instead of using real-world data for models to mirror the world we live in, we use synthetic data representative of a world in which we wish to live, how do we ensure that this hypothetical future world this data represents is one that works for all of us?

Centuries ago, the rules and boundaries of society were decided on and codified by the literate, those that were fortunate enough to have the resources and access to education that allowed them to learn to read and write. The rules that governed the masses and defined our way of life were written into law, and, intentionally or otherwise, these rules tended to benefit those that had the resources and power to be in the position to write them. As technological advancement saw literacy rates increase, “legalese” – technical jargon used to obfuscate the meaning of legal documents, was used to construct a linguistical barrier once again, now to those that do not have the resources to attain a qualification in law.

We’re now firmly in the technological age. As computers and software become ever more deeply ingrained into the fabric of society, it’s important that we as developers are aware of the fact that, if we’re not careful with where and how we develop and integrate our technological solutions, we could be complicit in allowing existing systems of inequality and exploitation to be solidified into the building blocks of our society for the future. Technologies like AI and ML have the ability to allow us to tackle systemic issues in our world to benefit us all, not just those fortunate enough to sit behind the keyboard or their CEOs.

However, to achieve this, we must move forward with care, with caution, and with consideration for those outside the tech space. We’re not the only ones influenced by what we create. At a time where the boot of oppression can be destroyed, it’s important that it doesn’t just end up on a different foot.

The importance of well-designed AI

This is absolutely not to say that AI and ML should be abandoned because of the theoretical dangers that could be faced as a result of careless usage – it means these tools should be utilised and explored in the right places, and in the right way. The potential benefits that well-implemented AI/ML can provide, and the fundamental improvements to our way of life and our collective human prosperity that this technology can bring could change the world for the better, forever.

Technologies such as active learning and deep learning have the capabilities to help automate, streamline and simplify tasks that would otherwise rely on vast amounts of manual human effort.

The reduction in manual human effort and attention required for tasks that can safely and reliably be operated by AI/ML, and the insights that can be gained from its implementation can lead to further advancements in science, exploration and innovation in art, and greater work-life balance, giving us back time for leisure and opportunities for shared community experiences, creating a more connected, understanding society.

That being said, there’s just as much opportunity for misuse of these tools to create a more imbalanced, divided, exploited world, and it’s our job as developers and decision-makers to steer clear of this, pushing this technology and its implementations in the right direction.

In conclusion

I believe that if synthetic data is going to comprise a large majority of the data in use in the near future, it is vitally important that we stay aware of the potential pitfalls of using such data, and make sure to utilise it only where it makes the most sense. The difficulty here for each individual ML project is in determining whether synthetic data or real-world data is the ideal choice for that specific use case of building an ML model. The Turing Institute’s FAST Track Principles for AI Ethics (Fairness, Accountability, Sustainability/Safety and Transparency) provide a strong framework for ethical decision-making and implementation of AI and ML technology – the spirit of these principles must be applied to all forms of development in the AI/ML space, including the use of synthetic data.

There’s no room for complacency. With great power, comes great responsibility.

To learn more about an AI-Driven Approach to Data Quality,��download our AI whitepaper��by��Dr. Browne.
And for more from��Ģ��Ƶ, find us on�� or��.

The post Battling Bias in AI: Models for a Better World appeared first on Ģ��Ƶ.

Outlier Detection – What Is It And How Can It Help In The Improvements Of Data Quality?��

Matt Neill — Fri, 27 May 2022 11:05:50 +0000

Identifying outliers and errors in data is an important but time-consuming task. Depending on the context and domain, errors can be impactful in a variety of ways, some very severe. One of the issues with detecting outliers and errors is that they come in many different forms. There are syntactic errors, where a value like a date or time is in the wrong format, and semantic errors, where a value is in the correct format but doesn’t make sense in the context of the data, like an age of 500. The biggest problem with creating a method for detecting outliers in dataset is how to identify a vast range of different errors with the one tool.

At Ģ��Ƶ, we’ve been working on a tool to solve some of these problems and enable errors and outliers to be quickly identified with minimal user input. With this project, our goal is to assign a number to each value in a dataset which represents the likelihood that the value is an outlier. To do this we use a number of different features of the data, which range from quite simple methods like looking at the frequency of a value or its length compared to others in its column, to more complex methods using n-grams and co-occurrence statistics. Once we have used these features to get a numerical representation of each value, we can then use some simple statistical tests to find the outliers.

When profiling a dataset, there are a few simple things you can do to find errors and outliers in the data. A good place to start could be to look at the least frequent values in a column or the shortest and longest values. These will highlight some of the most obvious errors but what then? If you are profiling numeric or time data, you could rank the data and look at both ends of the spectrum to see if there are any other obvious outliers. But what about text data or unique values that can’t be profiled using frequency analysis? If you want to identify semantic errors, this profiling would need to be done by a domain expert. Another factor to consider is the fact that this must all be done manually. It is evident that there are a number of aspects of the outlier detection process that limit both its convenience and practicality. These are some of the things we have tried to address with this project.

When designing this tool, our objective was to create a simple, effective, universal approach to outlier detection. There are a large number of statistical methods for outlier detection that, in some cases, have existed for hundreds of years. These are all based on identifying numerical outliers, which would be useful in some of the cases listed above but has obvious limitations. Our solution to this is to create a numerical representation of every value in the data set that can be used with a straightforward statistical method. We do this using features of the data. The features currently implemented and available for use are:

Character N-Grams
Co-Occurrence Statistics
Date Value
Length
Numeric Value
Symbolic N-Grams
Text Similarities
Time Value

We are also working on creating a feature of the data to enable us to identify outliers in time series data. Some of these features, such as date and numeric value are only applicable on certain types of data. Some incorporate the very simple steps discussed above, like occurrence and length analysis. Others are more complicated and could not be done manually, like co-occurrence statistics. Then there are some, like the natural language processing text similarities, which make use of machine learning algorithms. While there will be some overlap in the outliers identified by these features, on the most part, they will all single out different errors and outliers, acting as an antidote to the heterogenous nature of errors discussed above.

One of the benefits of this method of outlier detection is its simplicity which leads to very explainable results. Once features of our dataset have been generated, we have a number of options in terms of next steps. In theory, all of these features could be fed into a machine learning model which could then be used to label data as outlier and non-outlier. However, there are a number of disadvantages to this approach. Firstly, this would require a labelled dataset to train the model with, which would be time-consuming to create. Moreover, the features will differ from dataset to dataset so it would not be a case of “one model fits all”. Finally, if you are using a “black box” machine learning method when a value is labelled as an outlier, you have no way of explaining this decision or evidence as to why this value has been labelled as opposed to others in the dataset.

All three of these problems are avoidable using the Ģ��Ƶ approach. The outliers are generated using only the features of the original dataset and, because of the statistical methods being used, can be identified with nothing but the data itself and a confidence level (a numerical value representing the likelihood that a value is an outlier). There is no need for any labelling or parameter-tuning with this approach. The other big advantage is, that due to the fact we assign a number to every value, we have evidence to back-up every outlier identified and are able to demonstrate how they differ from other none-outliers in the data.

Another benefit of this approach is that it is modular and therefore completely expandable. The features the outliers are based on can be selected based on the data being profiled which increases accuracy. Using this architecture also give us the ability to seamlessly expand the number of features available to be used and if trends or common errors are encounter that aren’t identified using the current features, it is very straightforward to create another feature to rectify this.

And for more from Ģ��Ƶ, find us on , , or

The post Outlier Detection – What Is It And How Can It Help In The Improvements Of Data Quality?�� appeared first on Ģ��Ƶ.

AI Ethics: The Next Generation of Data Scientists

Matt Flenley — Mon, 04 Apr 2022 12:54:50 +0000

In March 2022, Ģ��Ƶ took advantage of the offer to visit a local secondary school and the next generation of Data Scientists to discuss AI Ethics and Machine Learning in production. Matt Flenley shares more from the first of these two visits in his latest blog below…

Students from Wallace High School meet Dr Fiona Browne (centre) and Matt Flenley (right)

AI Ethics is often the poster child of the modern discourse on whenever the inevitable machine-led apocalypse occurs. Yet, as we look around at wars in Ukraine and Yemen, record water shortages in the developing world, and the ongoing struggle for the education of girls in Afghanistan, it becomes readily apparent that as in all things, ethics starts with humans.

This was the main thrust of the discussion with the students at Wallace High School in Lisburn, NI. As Dr Fiona Browne, Head of AI and Software Development, talked the class of second-year A-Level students through data classification for training machine learning models, the question of ‘bad actors’ came up. What if, theorised Dr Browne, people can’t be trusted to label a dataset correctly, and the machine learning model learns things that aren’t true?

At this stage, a tentative hand slowly raised in the classroom; one student confessed that, in fact, they had done exactly this in a recent dataset labelling exercise in class. It was the perfect opportunity to detail in a practical way how the human involvement in Artificial Intelligence, Machine Learning, and especially in the quality of the data underpinning both.

Humans behind the machines, and baked-in bias

As is common, the exciting part of technology is often the technology itself. What can it do? How fast can it go? Where can it take me? This applies just as much to the everyday, from home electronics through to transportation, as it does to the cutting edge of space exploration or genome mapping. However, the thought processes behind the technology, imagined up by humans, specified and scoped by humans, create the very circumstances for how those technologies will behave and interact with the world around us.

In her promotion for the book , the author Caroline Criado-Perez writes,

“Imagine a world where your phone is too big for your hand, where your doctor prescribes a drug that is wrong for your body, where in a car accident you are 47% more likely to be seriously injured, where every week the countless hours of work you do are not recognised or valued. If any of this sounds familiar, chances are that you’re a woman.”
Caroline Criado-Perez, Invisible Women

One example is of the comparatively high rate of anterior cruciate ligament injuries among female soccer players. While some of this can be attributed to different anatomies, it is in part caused by the lack of female-specific footwear in the sport (with most brands choosing to offer smaller sizes rather than tailored designs). Yet the anatomical design of the female knee in particular is substantially different to that of males. Has this human-led decision, to simply offer small sizes, taken into account the needs of the buyer, or the market? Has it been made from the point of view of creating a fairer society?

The Ģ��Ƶ team (L to R: Matt Flenley, Shauna Leonard, Edele Copeland) meet GCSE students from the Wallace High School as part of a talk on Women in Technology Careers

If an algorithm was therefore applied to specify a female-specific football boot from the patterns and measurements of existing footwear on the market today, would it result in a different outcome? No, of course not. It takes humans to look at the world around us, detect the risk of bias, and then .

It is the same in computing. The product, in this case the machine learning model or AI algorithm, is going to be no better than the work that has gone into defining and explaining it. A core part of this is understanding what data to use, and of what quality the data should be.

Data Quality for Machine Learning – just a matter of good data?

Data quality in a business application sense is relatively simple to define. Typically a business unit has requirements, usually around how complete the data is and to what extent the data in it is unique (there are a wide range of additional data quality dimensions, which you can read about here). For AI and Machine Learning, however, data quality is a completely different animal. On top of the usual dimensions, the data scientist or ML engineer needs to consider if they have all the data they need to create unbiased, explainable outcomes. Put simply, if a decision has been made, then the data scientists need to be able to explain why and how this outcome was reached. This is particularly important as ML becomes part and parcel of everyday life. Turned down for credit? Chances are an algorithm has assessed a range of data sources and generated a ‘no’ decision – and if you’re the firm whose system has made that decision, you’re going to need to explain why (it’s the law!).

This is the point at which we return to the class in Wallace High School. The student tentatively raising their arm would have got away with it, with the model predicting patterns incorrectly, if the student had stayed silent. There was no monitoring in place to detect which user had been the ‘bad actor’ and so the flaw would have gone undetected without the student’s confession. It was, however, utterly perfect to explain the need to free algorithms from bias, for this next generation of data scientists. In the five years between now and when these students are working in industry, they will need to be fully aware of needing every possible aspect of the society people wish to inhabit being in the room when data is being classified, and models are being created.

For an industry still so populated , it is clear that the decision to do something about what comes next lies where it always has: in the hearts, minds and hands of technology’s builders.

The post AI Ethics: The Next Generation of Data Scientists appeared first on Ģ��Ƶ.

How Data Quality Tools Deliver Clean Data for AI and ML

Fiona Browne — Mon, 21 Feb 2022 13:26:50 +0000

In her previous blog Dr Fiona Browne, Head of AI and Software Development, assessed the need for the AI and Machine Learning world to prioritise the data that is being fed into models and algorithms (and you can read it here. ) This blog goes into some of the critical capabilities for data quality tools to support specific AI and ML use cases with clean data.

A Broad Range of Data Quality Tool Features On Offer

The data quality tools market is full of vendors with a wide range of capabilities, as referenced in the recent Gartner Magic Quadrant. Regardless of the firm’s data volumes, or whether they are a small, midsize or large enterprise, they will be reliant on high quality data for every conceivable business use case, from the smallest product data problem to enterprise master data management. Consequently, data leaders should explore the competitive landscape fully to find the best fit to their data governance culture and the growth opportunities that the right vendor-client fit can offer.

Labelling Datasets

A supervised Machine Learning (ML) model learns from a training dataset consisting of features and labels.

We do not often hear about the efforts required to produce a consistent, well-labelled dataset, yet this will have a direct impact on the quality of a model and the predictive performance, regardless of organization size. A recent Google research report estimates that within an ML project, data labelling can cost between 25%-60% of the total budget.

Labelling is often a manual process requiring a reviewer to assign a tag to a piece of data e.g. to identify a car in an image, state if a case is fraudulent, or assign sentiment to a piece of text.

Succinct, well defined labelling instructions should be provided to reduce labelling inconsistencies. Where data quality solutions can be applied in this context includes the use of metrics to measure the label consistency within a dataset, and based on this, review and improve consistency scores.

As labelling is a laborious process, and access to resource to provide the labels can be limited, we reduce the volume of manual labelling using an active learning approach.

Here, ML is used to identify the trickiest, edge cases within a data set to label. These prioritised cases are passed to a reviewer to manually annotate without the need to label a complete data set. This approach also captures the rationale from a human expert as to why a label was provided, which provides transparency in predictions further downstream.

Entity resolution

For data matching and entity resolution, Ģ��Ƶ has used ML as a ‘decision aid’ for low confidence matches to reduce again the burden of manual review. The approach implemented by Ģ��Ƶ provides information on the confidence of the predictions through to the rationale as to why a prediction was provided. Additionally, the solution has built in the capability to accept or reject the predictions, so the client can continually update and improve the predictions required, using that fully-explainable, human in the loop approach. You can see more information on this in our White Paper here.

Detecting outliers and predicting rules

This is a critical step in a fully AI-augmented data quality journey, occurring in the key data profiling stage, before data cleansing. It empowers business users, who are perhaps not familiar with big data techniques, coding or programming, to rapidly get to grips with the data they are exploring. Using ML in this way helps them to uncover relationships, dependencies and patterns which can influence which data quality rules they wish to use to improve data quality or deliver better business outcomes, for example regulatory reporting or digital transformation.

This automated approach to identifying potentially erroneous data within your dataset and highlighting these within the context of data profiling reduces manual effort spent in trying to find these connections across different data sources or within an individual data set. It can remove a lot of the heavy lifting associated with data profiling especially when complex data integration or connectivity to data lakes or data stores is required.

The rule prediction element complements the outlier detection. It involves reviewing a data set, and suggesting data quality rules that can be run against this set to ensure compliance to both regulations and to standard dimensions of data quality, e.g. consistency, accuracy, timeliness etc., and for business dimensions or policies such as credit ratings or risk appetite.

Fixing data quality breaks

Again, ML helps in this area where the focus is placed on manual tasks for remediating erroneous or broken data. Can we detect trends in this data, for example on the first day of the month, we ingest a finance dataset and which causes a spike in data quality issues? Is there an optimal path to remediation that we can predict, or are there remediation values that we can suggest?

For fixing breaks, we have seen the use of rewards to the best performing teams which builds that value of the work. This gamification approach can support business goals through optimal resolution of key issues that matter to the business, rather than simply trying to fix everything that is wrong, all at once.

Data Quality for Explainability & Bias

We hear a lot about the deployment of ML models and the societal issues in terms of bias and fairness of a model. Applications of models can have a direct, potentially negative impact on people, and it stands to reason that everyone involved in the creation, development, deployment and evaluation of these models should take an active role in preventing such negative impacts from arising.

Having diverse representative teams building these systems is important. For example, a diverse team could have ensured that Google’s speech recognition software was trained on a diverse section of voices. In 2016, Rachael Tatman, a research fellow in linguistics at the University of Washington, found that Google’s speech-recognition software was .

Focusing on the data quality of the data that feeds our models can help identify areas of potential bias and unfairness. Interestingly, bias isn’t necessarily a bad thing. Models need bias in the data in order to discriminate between outcomes, e.g. having a history of a disease results in a higher risk of having that disease again.

The bias we want to be able to detect is unintended bias and, accordingly, unintended outcomes (and of course, intentional bias created by bad actors). For example, using techniques to identify potential proxy features, e.g. post or ZIP code even when discriminatory variables are removed such as race. suggest metrics to run against datasets to highlight potential bias e.g. using class labels such as race, gender and running metrics against the decisions made by the classifier. From this identification there are different approaches that can be taken to address these issues such as balancing a dataset, within an algorithm to penalise a bias through to the post processing in favouring a particular outcome.

Explainable AI (XAI)’s Role In Detecting Bias

XAI is a nascent field where ML is used to explain the predictions made by a classifier. For instance LIME (Local Interpretable Model-agnostic Explanations) provides a measure of ‘feature importance’. So if we find that postcode which correlates with race is a key driver in a prediction, this could highlight discriminatory behaviour within the model.

These approaches explain the local behaviour of a model and fit an interpretable model, such as a tree or linear regression. Again, the type of explanation will differ depending on an audience. For example, different processes may be needed to provide an explanation at an internal or data scientist level compared to an external client or customer level. Examples could be extended by providing reason and action codes as to why credit was refused.

Transparency can be provided in terms of structured framework for reporting on ML model provenance, usage, and ethics-informed evaluation and give a detailed overview of a model’s suggested uses and limitations. This can be extended to the data side, and contain meta-data such data provenance, consent sought, and so on and so forth.

That being said, there is no single ‘silver-bullet’ approach to address these issues. Instead we need to use a combination of approaches and to test often.

Where to next – Machine Learning Ops (MLOps)

These days, the ‘-ops’ suffix is often appended to business practices right across the enterprise, from DevOps to PeopleOps, reflecting a systematic approach to how a function behaves and is designed to perform.

In Machine Learning, that same systematic approach, providing transparency and auditability, helps to move the business from brittle data pipelines to a proactive data approach that embeds human expertise.

Such an approach would identify issues within a process and not rely on an engineer identifying an issue by chance or individual expertise, which of course does not scale and is not robust. This system-wide approach embeds governance, security, risk and ownership at all levels. It does require a need for integration of expertise, for example the model developers gain an understanding into what risk is from knowledge transferred from risk officers and subject matter experts.

We need a maturing of the MLOps approach to support these processes. This is essential for high-quality and consistent flow of data throughout all stages of a project and to ensure that the process is repeatable and systematic.

It also neccessitates monitoring the performance of the model once in production, to take into account potential data drift or concept drift, and address this as and when identified. It should be said that testing for bias, robustness and adversarial attacks is still in nascent stages, but all this serves to do is highlight the importance of an MLOps approach right now rather than wait until these capabilities are fully developed.:

In practical terms, groups such as the have significant potential to help the public and private sectors better understand the key issues, clarify the priorities and determine what actions are needed to support the safe adoption of AI in financial services.

The post How Data Quality Tools Deliver Clean Data for AI and ML appeared first on Ģ��Ƶ.

Ģ��Ƶ demonstrates rapid matching capabilities on open datasets

Fiona Browne — Fri, 17 Dec 2021 11:31:33 +0000

This blog from Fiona Browne, Head of Software Development & AI at Ģ��Ƶ, covers the subject of matching data across open datasets, a project for which the firm secured Innovate UK funding.

The Rapid Match project is a vehicle to address the complexity of integrating data and matching data at scale providing a platform for reproducible data pipelines for post and current COVID analysis.

The project provides a generalised framework for data quality, preparation, and matching which is easy to use and reproducible for the integration and merging of diverse datasets at scale.

We highlighted this capability through a Use Case on the identification of financial risk across regions in the UK. Using the Ģ��Ƶ platform, data quality, preparation and matching tasks were undertaken to integrate diverse UK Office of National Statistics (ONS) and UK Companies House (CH) datasets to provide a view on regional funding and sectors and the impact of COVID.

The project is a vehicle to address the complexity of integrating data and matching data at scale providing a platform for reproducible data pipelines for post and current COVID analysis.

COVID-19 related datasets are being generated at speed and volume including governmental sources from ONS, local authorities, open data through to third party datasets. Value is obtained from integrating these data together to provide a view on a particular problem area. For example, fraud detection. It is estimated that British banks have lent about £68 billion through a trio of loan programs, with repayments backstopped by the Government. Concerns have been raised about the risk of fraud, and one estimate found defaults and fraud in the Bounce Back program for small businesses could reach 80% in the worst case.

Why?

Institutions and governments need rapid access to high quality data to inform decision making processes. It is essential for the data to be of high quality, accurate and up to date. In order to do this, data needs to be complete, high quality and obtained in timely fashion. These data need to be generated at speed and volume with value achieved from integration. This is often both a tricky and time-consuming process. Furthermore, processes to perform this are often fragmented, ad-hoc, non-systematic, brittle and difficult to reproduce and maintain.

What?

The Rapid Match project addressed the challenges around data quality and matching at scale through a systematic process which joins large amounts of messy, incomplete data in varying formats, from multiple sources. We provide a reliable ‘match engine’ allowing government and organisations to accurately and securely integrate diverse sources of data.

A key outcome of the project has been the data quality applied to the UK Companies House datasets. Companies House datasets are applied to a wide range of applications from providing a register of incorporated UK companies through use in KYC on-boarding and AML checks performed by institutions. It is estimated that “millions of professionals use Companies House data daily”. For example, in due diligence to verify ultimate beneficiary ownership through to matching against financial crime and terrorism lists.

What to do next

If you are considering how to approach your data matching strategies and would like to view the work we carried out, please get in touch with on LinkedIn.

And for more from Ģ��Ƶ, find us on , or

The post Ģ��Ƶ demonstrates rapid matching capabilities on open datasets appeared first on Ģ��Ƶ.

Artificial Intelligence can help businesses thrive

Fiona Browne — Thu, 02 Dec 2021 17:10:52 +0000

The coronavirus pandemic produced challenges not one of us could have expected. While some sense of normality is returning, many businesses still face an uphill battle to recover. Artificial Intelligence Technology, however, presents a solution for firms hoping to thrive once again.

Artificial Intelligence (AI) is being used for predictive tasks from fraud detection through to medical analytics. A key component of AI is the underlying data. Data impacts predictions, scalability and fairness of AI systems. As we move towards data-centric AI, having good quality, fair, representative, reliable and complete data will provide firms with a strong foundation to undertake tasks such as decision making and knowledge to strengthen their competitive position. In fact, AI solutions can be used to improve data quality when applied to tasks such as data labelling, accuracy, consistency, and completeness of data.

AI can help businesses not only improve and integrate data, but it will help their business grow through cost reduction and profit enhancement by reducing annual tasks. It has been predicted by Gartner that the business value created by AI will reach $3.9 trillion in 2022.

Businesses thrive with AI. It can automate financial forecasting, giving them greater visibility of their future finances and in turn empowering business owners to make better decisions and take actions to achieve their ultimate goals.

A key challenge for organisations is understanding the business objectives of deploying AI solutions. Therefore, moving away from using AI for technology sake towards awareness of what is feasible and how AI can be harnessed to address these objectives. This is a significant stumbling block for businesses to understand the benefits it can bring to their organisation.

The perceived lack of access to technology and need for copious amounts of data to train machine learning models are other stumbling blocks. We must bust the myth that AI is hard to access, for instance open source projects such as TensorFlow through to Microsoft Azure ML and Amazon Sage Maker are simplifying the process of building, deploying and monitoring machine learning models in production. Most companies don’t know this or how to take advantage of AI cost effective nature.

Even though accessing the technology is easy, using it is less so. Vendors are investing heavily in making the technology more accessible to non-expert users and have overall made great strides in making AI accessible.

That is why the upcoming AI Con Conference on 3 December at Titanic Belfast is so important. It gives us the perfect opportunity to discuss the benefits of AI for local firms.

Bringing together business leaders with world-leading technology professionals, AI Con will examine how artificial intelligence is changing our world and the opportunities and challenges it presents.

The themes for this year’s conference, which hosted 450 attendees in its first year and 800 in a virtual format last year, include Applied AI, AI Next and the Business of AI. These are designed for a general audience, tech audience and business audience respectively, and encompass everything from how AI can add value to organisations to what start-ups in the space should know.

The importance of AI cannot be disputed. AI Con will provide us with an opportunity to showcase the very best of AI. With Belfast now being a recognised tech hub, AI Con provides the perfect opportunity to foster debate and discussion around the benefits AI provides for business. Engagement with key business leaders and organisations is an essential part of that.

To find out more information about this year’s AI Con visit

And for more from��Ģ��Ƶ, find us on��,��or��

The post Artificial Intelligence can help businesses thrive appeared first on Ģ��Ƶ.

Rules Suggestion – What is it and how can it help in the pursuit of improving data quality?��

Jamie Gordon — Wed, 15 Sep 2021 09:06:21 +0000

Written by Daniel Browne, Machine Learning Engineer

Defining data quality rules and collection of rules for data quality projects is often a manual time-consuming process. It often involves a subject matter expert reviewing data sources and designing quality rules to ensure the data complies with integrity, accuracy and / or regulatory standards. As data sources increase in volume and variety with potential functional dependencies, the task of defining data quality rules becomes more difficult. The application of machine learning can aid with this task by identifying dependencies between datasets through to the uncovering patterns related to data quality and suggesting previously applied rules to similar data.

At Ģ��Ƶ, we recently undertook a Rule Suggestion Project to automate the process of defining data quality rules for datasets through rule suggestions. We use natural language processing techniques to analyse the contents of a dataset and suggest rules in our rule library that best fit each column.

Problem Area and ML Solution

Generally, there are several data quality and data cleansing rules that you would typically want to apply to certain fields in a dataset. An example is a consistency check on a phone number column in a dataset such as checking that the number provided is valid and formatted correctly. Unfortunately, it is not usually as simple as searching for the phrase “phone number” in a column header and going from there. A phone number column could be labelled “mobile”, or “contact”, or “tel”, for example. Doing a string match in these cases may not uncover accurate rule suggestions. We need context embedded into this process and this is where machine learning comes in. We’ve been experimenting with building and training machine learning models to be able to categorise data, then return suggestions for useful data quality and data cleansing rules to consider applying to datasets.

Human in the Loop

The goal here is not to take away control from the user, the machine learning model isn’t going to run off with your dataset and do what it determines to be right on its own – the aim is to assist the user and to streamline the selection of rules to apply. A user will have full control to accept or reject some or all suggestions that come from the Rule Suggestion model. Users can add new rules not suggested by the model and this information is captured to improve the suggestions by the model. We hope that this will be a useful tool for users to make the process of setting up data quality and data cleansing rules quicker and easier.

Developers View

I’ve been involved in the development of this project from the early stages, and it’s been exciting to see it come together and take shape over the course of the project’s development. A lot of my involvement has been around building out the systems and infrastructure to help users interact with the model and to format the model’s outputs into easily understandable and useful pieces of information. This work surrounds allowing the software to take a dataset and process it such that the model can make its predictions on it, and then mapping from the model’s output to the individual rules that will then be presented to the user.

One of the major focuses we’ve had throughout the development of the project is control. We’ve been sure to build out the project with this in mind, with features such as giving users control over how cautious the model should be in making suggestions by being able to set confidence thresholds for suggestions, meaning the model will only return suggestions that meet or surpass the chosen threshold. We’ve also included the ability to add specific word-to-rule mappings that can help maintain a higher level of consistency and accuracy in results for very specific or rare categories that the model may have little or no prior knowledge of. For example, if there are proprietary fields that may have their own unique label, formatting, patterns or structures, and their own unique rules related to that, it’s possible to define a direct mapping from that to rules so that the Rule Suggestion system can produce accurate suggestions for any instances of that information in a dataset in the future.

Another focus of the project we hope to develop further upon is the idea of consistently improving results as the project matures. In the future we’re looking to develop a system where the model can continue to adapt based on how the suggested rules are used. Ideally, this will mean that if the model tends to incorrectly predict that a specific rule or rules will be useful for a given dataset column, it will begin to learn to avoid suggesting that rule for that column based on the fact that users tend to disagree with that suggestion. Similarly, if there are rules that the model tends to avoid suggesting for a certain column that users then manually select, the model will learn to suggest these rules in similar cases in the future.

In the same vein as this, one of the recent developments that I’ve found really interesting and exciting is a system that allows us to analyse the performance of various different machine learning models on a suite of sample data, which allows us to gain detailed insights into what makes an efficient and powerful rule prediction model, and how we can expect models to perform in real-world scenarios. It provides us with a sandbox to experiment with new ways of creating and updating machine learning models and being able to estimate baseline standards for performance, so we can be confident of the level of performance for our system. It’s been really rewarding to be able to analyse the results from this process so far and to be able to compare the different methods of processing the data and building machine learning models and see which areas one model may outperform another and so on.

Thanks to Daniel for talking to us about rules suggestion. If you would like to discuss further or find out more about rules suggestion at Ģ��Ƶ, reach out to directly or you can reach out to our Head of AI.

Get in touch or find us on , , or .

The post Rules Suggestion – What is it and how can it help in the pursuit of improving data quality?�� appeared first on Ģ��Ƶ.

Ģ��Ƶ is involved with the KTN: AI for Services UK Tour!

Jamie Gordon — Tue, 23 Feb 2021 11:30:00 +0000

The first stop on the AI for Services UK will be Northern Ireland curated by the fantastic team at and !

We are delighted that��Ģ��Ƶ��will be one of the companies involved, the aim of the event is to discover the innovation taking place across the UK in the professional and financial, insurance, accountancy and law��sectors.��

Kainos,��Adoreboard��and Analytics Engines are in amongst the few other companies also representing Northern Ireland in the AI for Services Tour.��Ģ��Ƶ Head of AI, Dr Fiona Browne will be pitching at the event.��We thought it would be a good idea to catch up with Dr Browne ahead of the event to find out what it’s all about!��

Hi Fiona! Could you tell me more about the event and why��Ģ��Ƶ��is involved?��

The AI��for��Services event is��a UK-wide event��hosted��by KTN��Innovate UK and we are part of the NI cohort. The��event is a roadshow, which will��provide the opportunity for��companies from all the different regions to highlight what they are doing in terms of��innovation��and��AI and how these can address��areas within the various sectors.��The roadshow will also allow each of the companies to pitch��to organisations in different sectors including Accountancy, Insurance and Financial Services.

Fiona, you will be giving one of these pitches at the event. What can you tell us about it?��

All the regions have a chance to provide a��7-minute��pitch. We will be describing who��Ģ��Ƶ��are��and what��we specialise in (Data Quality and Matching). We will be focusing on a particular use case, which is related to Onboarding and the role of entity matching within this process, highlighting the recent work we have done in this area. We will be highlighting the data quality required before the matching process occurs, but also how we have augmented our matching process with machine learning.��

If you could pick one key takeaway��that you would want people to��get��from the pitch, what would it be?��

I think the key message to takeaway is that Machine Learning (ML) has a role to play in��addressing manual��time-consuming��task and when applied to the correct applications, it can make efficiencies savings. However, good��ML is built on quality data and effort is needed to ensure that you have a��reproducible��data quality pipeline in place.��At��Ģ��Ƶ��we pride ourselves on our��data quality and matching technology and have innovated in these areas.��We are��really excited��about the developments we are making, and we can’t wait to tell you more!��

Ģ��Ƶ��will be representing NI. Do you think that the talent here locally and the technological developments are matching up to the rest of the UK?��

Yes! There’s a real focus on Artificial Intelligence and FinTech within NI.��The country may be��small��in size��but in terms of capabilities it��offers great solutions.��

What do you hope to be the biggest takeaway for attendees��on the whole event?��

The idea of this event is��for companies��within sectors such as finance, insurance, law and accountancy who are embarking or on their way��to their��digital transformation��journey��to connect with companies that offer��innovative solutions.��At Ģ��Ƶ we want to better understand��the bottlenecks and��pain points��that these companies in these sectors are facing and offer a solution��that addresses these. We hope to deepen our specialist knowledge in understanding the current challenges in the industry so that we can tailor our technology to solve real business problems. We��will��showcase our��self-service��data quality and matching��solutions��highlighting the��continual developments we have made with machine learning to augment the matching process.��

It is also a great opportunity to leverage our presence in these sectors as we are primarily linked to financial and governmental. Accountancy, Law and Insurance are sectors that we haven’t��traditionally marketed to��but have similar��areas to address such as compliance to regulation and common data management challenges.��

What would you like the audience to share?

We will highlight what our solution is and what we do, but we want to understand better the pain points. Where do the difficulties lie?��Is it extracting knowledge from textual sources of information? Or is it issues with integrating different data sources? Or is it issues with adhering to regulations?��It will be good to hear first-hand from these organisations.

Are you looking forward to hearing any particular pitch on the day?��

I am looking forward to hearing them all. Particularly because all the companies are very different, it’ll be interesting to hear��more about their solutions and the innovations that they are��offering.��

How can attendees be able to get in touch with you?��

You��can��register as��a delegate��to hear the presentations��. Then, Innovate UK is using a platform called Meeting where 1:1 meeting can be booked��between��12:30-2 pm��with��companies.��

The event is sure to be a good one,��we are excited to be involved. We are most excited to learn more about the different sectors!��Keep an eye on the KTN social media pages for updates��on the event. KTN also has an events archive where you can listen to past events if you have missed them, check it out .

Visit��here��for more by Ģ��Ƶ, or find us on��,��or��for the latest news.��

The post Ģ��Ƶ is involved with the KTN: AI for Services UK Tour! appeared first on Ģ��Ƶ.

AI Con 2020 Interview with Dr. Fiona Browne and Matt Flenley

Jamie Gordon — Wed, 02 Dec 2020 12:00:36 +0000

Dr. Fiona Browne, Head of AI, and Matt Flenley, Marketing and Partnerships Manager at Ģ��Ƶ are contributing to AI Con 2020 this year.

After a successful first year, AI Con is back!

This year it’s said to be bigger and better than ever with a range of talks across AI, including AI/ML in Fintech; AI in the public sector; the impact of arts; the impact of AI on research and innovation; and how AI has caused a change in the screening industry. All these topics will be tackled by world-leading technology professionals and business leaders to unpack how AI is changing our world.

Ahead of AI Con 2020 taking place virtually on the 3rd and 4th December, we thought it would be a good idea to sit down with two of those industry experts, Fiona and Matt, and ask them a few things. I wanted to understand what their involvement with AI is this year, any previous involvements they’ve had with AI Con, what they envisage to be the key takeaways, and of course, what talks they are most looking forward to engaging with themselves.

Hi, Fiona and Matt. Perhaps to kick-off, you could tell talk a bit about why you both wanted to be involved with AI Con?

Fiona: Hello! Well, we were involved with it last year and it was a great experience. We were involved in the session that focused on business and the applications of AI. We were asked then to pull a session together for this year, and we’ve been able to focus on the area that Ģ��Ƶ specialises in, which is Financial Services.

This has given us the chance to unpack how machine learning can be used in Financial Services; we’ve tried to cover three broad areas within this session: firstly, understanding those people who work in the financial institutions. Secondly, we will then delve into our bread-and-butter data quality & matching, and lastly the importance of data governance.

Matt: Hi! Last year I worked with Fiona to arrange our involvement. This year, we had the chance to have more time to prepare. This meant that Fiona and I could collaborate even more so.

I particularly enjoyed approaching speakers such as Peggy and Sarah (to name but a few!). What interests me most is the application of AI and we are delighted to have contributed towards pulling together such a strong line-up.

The variety of talks too will bring a wide range of attendees!

This is the second year. Perhaps you both could talk to me about your previous involvement with AI Con, if any, and how it has evolved?

Fiona: Last year we discovered there was a significant appetite for this content. We have been able to expand this year’s conference over more streams by being more strategic with the messaging. We have also been able to create a session for ourselves (one that we know about and are vastly passionate and experienced in). This year, the conference is not local, it’s much more international. Even if you look at the line-up of our speakers for our session, they come from New York and Switzerland.

The International flavour offers a greater perspective, knowledge, and insight.

Matt: I agree. I’ve been blown away by how engaged people have been. We have Andrew Jenkins, the Fintech Envoy for Northern Ireland and Gary Davidson of Tech Nation, who are keen to contribute to where they think the market is going.

The panel I am chairing is focusing on FinTechs that are scaling and exporting with a focus on why people should invest in NI technology. The event is well-prepared and timely, and I am looking forward to chairing on Thursday.

So, Matt what will the panel you are chairing be discussing, who is on the panel?

Matt: We are joined by Pauline Timoney, COO of Automated Intelligence; Chris Gregg, CEO and Founder of Light Year; and as I mentioned before, Andrew Jenkins, and Gary Davidson. We are going to look at the opportunities to collaborate with incubators like TechNation, the impact of COVID-19, Brexit, and FinTech investments for last year.

FinTech is a hugely growing sector, and we are excited to delve into why and explore where the sector is going next!

Fiona, you have been one of the curators of AI Con, how has that process been?

Fiona: It has been great! We were given the remit of FinTech and we could pick and choose what topics and who we wanted to add to the line-up. We have a very clear message. The talks are practical application-centred with a focus on trends and experience.

One of the largest Wealth Management Companies in the world is coming to speak to discuss their usage of technology, future projections, and more!

What do you both envisage the biggest takeaways of AI Con being?

Matt: One of the biggest takeaways is going to be the incredible, thriving NI FinTech sector.

When you look around the ecosystem, for example of the you can see the sheer explosion of firms and the problems being solved.

Fiona: There will be maturity across the board, with more companies implementing these technologies.

People are increasingly thinking about Machine Learning and AI… how can we use it?

I believe there will be a skillset gap which will be a challenge; it will be a challenge for many firms to attract the talent that can implement these processes and technologies.

To wrap up! On a personal, note, what talk(s) are you both most looking forward to?

Matt: I am excited to hear from Sarah Gadd, Credit Suisse. Her wealth of experience will offer great insight into how they apply AI into reality. Not only are they on the cutting edge of technology but they have taken it off the ground. I am also looking forward to Peggy Tsai’s contribution.

Fiona: From our side, Sarah and Peggy will be interesting. It’s an honour to have a speaker like Sarah Gadd. It’s brilliant to hear how they are applying this technology now in a regulated area. What are their challenges, solutions? Also, Peggy is giving time to the complexity of data, which is more important than ever before. Austin too will be unpacking AI in the arts and music sector. I am looking forward to the overall variety, calibre, and diversity of point of view that will be offered.

Thank you both, for taking the time out of our schedules! If you haven’t got your place for AI Con 2020 reserved, there is no time like the present! You can secure your place for free . It will be a brilliant conference. Who’s ready to learn more about AI?

The post AI Con 2020 Interview with Dr. Fiona Browne and Matt Flenley appeared first on Ģ��Ƶ.

How can banks arm themselves against increasing regulatory and technological complexity? – FinTech Finance

Jamie Gordon — Tue, 03 Nov 2020 10:00:22 +0000

Ģ��Ƶ Head of Artificial Intelligence, Dr. Fiona Browne, recently contributed to the episode of FinTech Finance: Virtual Arena. Steered by Douglas MacKenzie, the interview covered the extent of the Anti-Money Laundering (AML) fines currently faced by banks over the last number of years and start to unpack what we do at Ģ��Ƶ in relation to this topic: helping banks address their data quality, with essential solutions designed to combat fraudsters and money launderers.

How can banks arm themselves against increasing regulatory and technological complexity?

Fiona began by highlighting how Financial Institutions face significant challenges when managing their data. However, the increase in financial regulations since the financial crisis of 2008/2009, ensuring data quality has gained in its importance, obliging institutions to have a handle on their data and make sure it is up to date. Modern data quality platforms mean that the timeliness of data can now be checked via a ‘pulse check’ to ensure that it can be used in further downstream processes and that it meets regulations.

Where does Ģ��Ƶ fit in to the AML arena?

A financial institution needs to be able to verify the client that they are working with when going through the AML checks. The AML process itself is vast but at Ģ��Ƶ, we focus on the area of profiling data quality and matching – it is our bread and butter. Fiona stressed the importance of internal checks as well as public entity data, such as sanction and watch lists.

In a nutshell, there is a significant amount of data to check and compare and with lack ofquality data, it becomes a difficult and costly task to perform so we at Ģ��Ƶ, focus on data quality cleansing and matching at scale.

Why should banks look to partner, rather than building it in house?

One of the key issues of doing this in house is not having the necessary resources to perform the required checks and adhere to the different processes in the AML pipeline. According to the Financial Conduct Authority (FCA), in-house checks and a lack of data are causing leading financial institutions to receive hefty fines. Fiona reiterated that when Banks bring it back to the fundamentals and get their processes right and data into order, they can then use the partner’s technology to automate and streamline these processes, which in turn speeds up the onboarding process and ensure the legislation is being met.

Why did the period of 2018/2019 have such a high number of AML breaches?

Fiona explained that many transactions go back over a decade, it takes time to identify such transactions. AML compliance is difficult to achieve and regulators know that it is challenging. The regulators are doing a better job at providing guidelines to financial institutions, enabling them to address these regulations. Fiona reaffirmed that perhaps 2018/2019 was a wakeup call that was well needed to address this issue.

And with AML fines already at $5.6 billion this year, more than the whole of 2019, what can banks do?

Looking at the US, where although the fines for non-compliant AML processes are not as high as 2019, there is still a substantial number of fines being issued, Fiona said that it is paramount to ensure financial institutions have the right data and the right processes in place. Although it can be considered as an administrative burden, there is real criminal activity behind the scenes, which is why AML is so important. It is vital that financial institutions get a handle on this, enabling them to also improve the experience for their clients.

The fines will continue to be issued. Why should firms look to clean data when they just want to get to the bottom line?

It is essential to have the building blocks in place. Data quality is key for the onboarding process, but it is also essential downstream, particularly if you are wanting to do more trend analysis. Getting the fundamentals right at the start will pay back in dividends.

Are there any other influences that Artificial Intelligence (AI) and Machine Learning (ML) can have on the banks onboarding process?

According to Fiona, there is no silver bullet. One AI/ML technique will not solve all the AML issues. It is about deploying these techniques when approaching the issues in different ways. A large part of the onboarding process is gathering data and extracting relevant information from the data set. Fiona has seen a lot of Neuro-Linguistic Programming (NLP) techniques employed to extract the data from documents. At Ģ��Ƶ, we use Machine Learning in the data matching process to reduce the manual review time. ML techniques are employed in supervised and unsupervised approaches geared to pinpoint fraudulent transactions. We think that the graph databases and network analysis side of machine learning is an interesting area, we are currently exploring how it can be deployed into AML and fraud detection.

Bonus content: In the US and Canada, one way to potentially identity fraud was to look at transactions that were over $10,000. The criminals however become increasingly savvy and utilise Machine Learning to muddy their tracks. By doing this, they can divide transactions into randomised amounts to make them appear less pertinent. As Fiona put it ‘the cat and mouse game’.

If you are employed in the banking sector or if you must deal with large and messy datasets, you will probably face challenges derived from poor data quality, standardization, and siloed information.

Ģ��Ƶ provides the tools to tackle these issues with minimum IT overhead, in a powerful and agile way. Get in touch with the self-service data quality experts today to find out how we can help.

The post How can banks arm themselves against increasing regulatory and technological complexity? – FinTech Finance appeared first on Ģ��Ƶ.

EDM Talks: Lifting the lid on the problems that Ģ��Ƶ solves

Jamie Gordon — Fri, 30 Oct 2020 09:00:00 +0000

Recently we partnered with the EDM Council on a video that investigates the application of AI to data quality and matching.

In this EDM Talk, we lift the lid on how our AI team is developing solutions to help our clients, especially in the area of entity matching and resolution. This plays an important role in on-boarding, KYC and obtaining a single customer view.

What is the the data challenge?

Institutions such as banks, often have large sets of very messy data which may be siloed and subject to duplication. When onboarding a new client or building a legal entity master, institutions may need to match clients to both internal datasets and external sources. These include vendors such as Dun and Bradstreet and Bloomberg, or taking data from a local company registration authority, such as Companies House in the UK. This data needs to be cleaned, normalised and matched to create a single golden record in order to verify their identify and adhere to regulatory compliance. For many institutions, this can be a heavily manual and time-consuming process.

What needs to be done to improve entity matching?

In entity resolution, there are two main challenges to address: the data matching side; and the manual remediation side which is required to resolve those instances where we have low confidence, mismatched or unmatched entities.

Ģ��Ƶ undertook a recent Use Case where we explored matching entities between two open global entity datasets Refinitiv ID and Global LEI. We augmented our fuzzy matching rule-based approach with ML to address and improve efficiencies around the manual remediation of low confidence matches. We performed matching of entities between these datasets using deterministic rules, as many firms do today. We followed the standard approach in place for many onboarding teams, whereby entity matches that are low confidence go into manual review. Within Ģ��Ƶ, data engineers were timed to measure the average time taken to remediate a low confidence match which could take up to one minute and a half per entity pair. This might be fine if there are just a few entities that you need to check but whenever you have hundreds, thousands or many hundreds of thousands this highlights how challenging the task becomes and the resource and time required to commit to this task.

At Ģ��Ƶ we thought this was an interesting problem to explore. We were keen to fully understand whether AI-enabled data quality and matching would bring benefits in terms of efficeincy and improvement to data quality to our clients who undertake such tasks.

What did Ģ��Ƶ want to achieve?

We were particularly interested to understand how we could reduce manual effort and increase the accuracy of data matching. We wanted to understand what benefits machine learning would bring to the process, using an approach that was transparent and which would make decision-making open and obvious to an audit or regulator.

What benefit is there from applying Machine Learning to this problem?

Machine learning is a broad domain. It covers application areas from speech recognition, understanding language to automating processes and decision making. Machine learning approaches are built on mathematical algorithms and statistical models. The advantages of these approaches is the ability of the algorithms to learn from data, uncover patterns and then use this learning to make predictions on new unseen cases. We see machine learning deployed in everyday life from our email filters through to personal assistance devices such as Amazon Echo and Apple Siri.

Within the financial sector, Machine Learning techniques are being applied to tasks including profiling behaviour for fraud detection; the use of natural language processing to extract information from unstructured text to enrich the Know Your Customer onboarding process; through to the use of chatbots to automatically address customer queries and customise product offerings. 

At Ģ��Ƶ we view Machine Learning as a tool to automate manual tasks through to a decision making aid augmenting processing such as matching, error detection and data quality rule suggestion for our clients. This then frees up time and resource for clients enabling them to do more in their role. 

How can machine learning be applied to the process of matching?

Within Ģ��Ƶ we have augmented our rules-based matching process with machine learning. Our solution has a focus on explainability and transparency to enable the tracing of why and how predictions have been made. This transparency is important to financial clients in terms of adhering to regulations through to the building of trust in the system which is providing these predictions. Using high confidence predictions, we can automate a large volume of manual review. For example, in the matching Use Case, we were able to reduce manual review burden by 45%, freeing up client’s time with expertise deployed to focus on the difficult edge cases.

At Ģ��Ƶ we train machine learning models using examples of matches and non matches. Over time patterns within that data are detected and this learning can be used to make predictions on new unseen cases. A reviewer can validate the predictions and feed this back into the algorithm. This is known as human in the loop machine learning. Eventually the algorithm will become smarter in predictions making more accurate predictions. High quality predictions can lead to less manual review, by reducing the volume that need reviewed.

The models we have built need good quality data. We used the Ģ��Ƶ self-service data quality platform to create good quality data sets and apply labels to that data. Moving forward at Ģ��Ƶ, we are seeking to augment AI and to look at graph linkage analysis, as well as furthering enhancing our feature engineering and data set capabilities.

To learn more about what��the work we are doing with machine learning and how we are applying it into the��Ģ��Ƶ��platform, all content is available on the��Ģ��Ƶ��website.��We also have a��whitepaper��on AI-enabled data quality.��

For a demo of the system in action please fill out the contact form.

To find out more about what we do at Ģ��Ƶ, check out the full EDM talks video !

We will soon be publishing Part 2 of this blog series that will look at the application of AI and ML in the Fintech sector in more detail as well as an entity resolution use case.

�� here for the latest news from Ģ��Ƶ, or find us on ,  o�� 

The post EDM Talks: Lifting the lid on the problems that Ģ��Ƶ solves appeared first on Ģ��Ƶ.

IRMAC Reflections with Dr. Fiona Browne

Jamie Gordon — Mon, 07 Sep 2020 09:00:00 +0000

There is a lot of anticipation surrounding Artificial Intelligence (Al) and Machine Learning (ML) in the media. Alongside the anticipation is speculation – including many articles placing fear into people by inferring that AI and ML will replace our jobs and automate our entire lives!

Dr Fiona Browne, Head of AI at Ģ��Ƶ recently spoke at an IRMAC (Information Resource Management Association of Canada) webinar, alongside Roger Vandomme, of Neos, to unpack what AI/ML is, some of the preconceptions, and the reasons why different approaches to ML are taken…

What is AI/ ML?

Dr. Browne clarified that whilst there is no official agreed-upon definition of AI, it can be depicted as the ability of a computer to perform cognitive tasks, such as voice/speech recognition, decision making, or visual perception. ML is a subset of AI, entailing different algorithms that learn from input data.

A point that Roger brought up at IRMAC was that the algorithms learn to identify patterns within the data and the used patterns enable the ability to distinguish between different outcomes, for example, the detection of a fraudulent or non-fraudulent transaction.��

ML takes processes that are repetitive and automates them. At Ģ��Ƶ, we are exploring the usage of AI and ML in our platform capabilities – Dr Fiona Browne��

What are the different approaches to ML?

Supervised, unsupervised, and reinforcement machine learning. Dr. Browne communicated that at a broad level, there are three approaches: supervised, unsupervised, and reinforcement machine learning.

In supervised ML, the model learns from a labelled training data set. For example, financial transactions would be labelled as either fraudulent or genuine fed into the ML model. The model then learns from this input and can distinguish the difference.

Where data is unlabelled, Dr. Browne explained that unsupervised ML would be more appropriate, where the model learns from unlabelled data. There is a key difference here with supervised ML in that the model would seek to uncover clusters or patterns inherent in the data to enable it to separate them out.

Finally, reinforcement machine learning involves models that continually learn and update from performing a task. For example, a computer algorithm learning how to play the game ‘Go’. This is achieved by the outputs of the model being validated and that validation being provided back to the model.

The difference between supervised learning and reinforcement learning is that in supervised learning the training data has the answer key with it, meaning the model is trained with the correct answer.

In contrast to this, in reinforcement learning, there is no answer, but the reinforcement agent selects what to do to perform the specific task.

It is important to remember that if there is no training dataset present, it is bound to learn from its experience. Often the biggest trial comes when a model is being transferred out of the training environment and into the real world.

Now that AI/ML and the different approaches have been unpacked… the next question is how does��explainability��fit into this? ��The next mini IRMAC reflection will unravel what��explainability��is and what the different approaches are. Stay tuned!��

Fiona has written an extensive piece on AI enabled data quality, feel free to check it out��here.��

Click here for more by the author, or find us on , or for the latest news.

The post IRMAC Reflections with Dr. Fiona Browne appeared first on Ģ��Ƶ.

Read how AI is transforming Data Quality in this exclusive white paper

Fiona Browne — Wed, 10 Jun 2020 20:00:43 +0000

��

In this AI whitepaper, authored by our Head of AI we provide an overview of Artificial Intelligence (AI) and Machine Learning (ML) and their application to Data Quality.

We highlight how tools in the Ģ��Ƶ platform can be used for key data preparation tasks including cleansing, feature engineering and dataset labelling for input into ML models.

A real-world application of how ML can be used as an aid to improve consistency around manual processes is presented through an Entity Resolution Use Case.

In this case study we show how using ML reduced manual intervention tasks by 45% and improved data consistency within the process.

Having good quality, reliable and complete data provides businesses with a strong foundation to undertake tasks such as decision making and knowledge to strengthen their competitive position. It is estimated that poor data quality can cost an institution on average $15 million annually.��

As we continue to move into the era of real-time analytics and Artificial Intelligence (AI) and Machine Learning (ML) the role of quality data will continue to grow. For companies to remain competitive, they must have in place flexible data management practices underpinned by quality data.

AI/ML are being used for predictive tasks from fraud detection through to medical analytics. These techniques can also be used to improve data quality when applied to tasks such as data accuracy, consistency, and completeness of data along with the data management process itself.

In this whitepaper we will provide an overview of the AI/ML process and how Ģ��Ƶ tools can be applied in cleansing, deduplication, feature engineering and dataset labelling for input into ML models. We highlight a practical application of ML through an Entity Resolution Use Case which addresses inconstancies around manual tasks in this process.

The post Read how AI is transforming Data Quality in this exclusive white paper appeared first on Ģ��Ƶ.

Artificial Intelligence (AI) & Machine Learning (ML) - Ģ������Ƶ