The EU Ethics Guidelines for Trustworthy Artificial Intelligence emphasize that any AI system should be lawful, ethical, and robust; and that each of these components in itself is necessary but not sufficient for trustworthy AI. Not challenging your AI system in these three aspects can have unexpected consequences, impacting your users, and your company; and not just through eventual bad public media exposure, but through actual performance setbacks. It is about ethics, but it is also about competitiveness. At PickleTech we provide algorithm audits as a service, and when this is available, the responsibility to challenge AI systems should not be evaded anymore. Here we write about the key elements we focus on when working on AI system auditing.
Do you develop, deploy and/or simply use an AI system in your organization? In that case, what is the expected performance AND uncertainty of the AI system? Which data was used to train it? What has been the actual performance of the AI system with real data during the last year? How does the AI system balance errors? Which mechanisms do you have in place to address them? Can you explain how the AI system goes from input features to system outputs? Which features are more important and how do they affect the AI system decisions? Can you explain these decisions for each data instance individually?
There are more questions to be answered, involving also other aspects, like privacy and fairness, and we actually discuss some of them throughout the post. But when you can not easily find an answer to any of the questions above, high chances are that your AI system could be performing in unpredictable ways; and thus impacting your operations negatively. This can become a real competitive and economical burden.
Several stakeholders and responsibilities are relevant for trustworthy AI systems, including but not limited to developers, organizations operating the system -product owners-, end users, anyone affected by the system, or even sometimes society as a whole. Here we focus on the perspective of an organization, private or public, that uses self-developed or external AI systems for its operation. Algorithm audit can contribute to the trustworthiness of the AI technology, improving its impact on the organization competitiveness.
This post is not meant to be exhaustive nor complete, as we believe there are very good resources out there for that. Beyond the EU guidelines themselves, see for instance the white paper from the Supreme Audit Institutions of Finland, Germany, the Netherlands, Norway and the UK. Instead, here we focus on highlighting some audit aspects that should not be missed, and that require deeper data science technical knowledge, including the ability to build experiments and test the AI models statistically.
At PickleTech we believe a big part of the skills needed for algorithm audit are strongly aligned with our experience in science. Working in research and development involves skepticism, thoroughly reviewing publications in peer reviewed journals, as developers, authors, referees, and field experts; plus a continued work with experimentation and hypothesis testing.

Model and Performance
For every aspect of the guidelines for trustworthy AI, any audit process requires to consider the AI system and its context together, in order to adapt the guidelines to the specific case. The audit may involve actions of different depth, from just reviewing the available AI system documentation, to actual tests involving the AI system operation. Keeping this in mind, we start with a key aspect concerning the data science model development and its operation: validation.
Internal Validation
The development of any AI system involving Machine Learning -and of course this applies to Deep Learning too- includes a process of model selection, model training, and model validation. This validation may be referred to as internal validation, and it includes common methodologies like cross validation, data splits, and hold out tests.
In our work at PickleTech, many times we rely on nested-cross validation frameworks because these are very powerful to combine model selection and hyper-parameter optimization, while yet providing fair estimates for generalizable performance metrics. This is, picking the best possible model without data contamination, and at the same time, computing estimates and uncertainties of how the model is going to perform out of sample, when it has to work on new data.
This is an essential step for statistical learning, linked to the system’s ability to learn from data. Worryingly, this is too many times overlooked and poorly assessed. If your AI system was developed without a proper validation framework, you may have a very wrong expectation and uncertainty for its performance, and chances are that data contamination, bias, and overfitting are problems you will soon have to face.
Ideally, internal validation should be properly designed and integrated in the AI system development, with the process being fully documented. Otherwise, when the AI system has been running in production for long, it may have been jeopardizing your operations already. In that case, a proper algorithm audit can become a life saver to pinpoint required fixes and improve the system’s performance.
Model Choices
When checking the internal validation process, an aspect of algorithm audit we focus on is reviewing the model selection and feature engineering choices. When a model is too complicated for a task, lacking a clear motivation; when too many features are ingested in the AI system, but only very few are finally consumed; or in general, when choices make the system overly complicated without obvious gains; these are all usual indications of potentially untrustworthy AI systems.
A good practice we follow at PickleTech when developing our own AI systems is setting in advance which are the simpler models -sometimes not even ML based- that are to be used as benchmarks. These benchmarks, together with pre-defined improvement criteria, are fundamental to assess when some model selection is actually justified. It may seem this is always obvious, but when you become too invested in a project as it moves forward, it is too easy to convince yourself that a very marginal gain provided by a fancy and overly complicated system is a reasonable direction to take. Most times, that is unfortunately a bad modeling choice.
External Validation and Adaptation
Even if internal validation is properly performed, model development is usually carried out using staged environments. There, data, model operation, and feedback mechanisms, are not dynamically connected with the final -real world- implementation of the system, particularly if the development is complex and prolongs in time.
In order to further validate the performance of any AI system, we need an actual experimental test within a real world context, what is sometimes referred to as a field test. In the external validation, the AI system operates with real world data, the process may involve actual system users, and it includes all steps in the AI system pipeline, and not just a subset of the solution like the ML model isolated. We must consider all actions leading to the final AI system decision and its integration within the context of the operations.
Another important aspect we assess when auditing is the human oversight component in the AI system, after all, respect for human autonomy is another principle for trustworthy AI. At the technical level, predictability and stability of the AI system are two other aspects external validation addresses, as these are fundamental requirements for the technical robustness and safety of the system.
The purpose of the external validation is to obtain an accurate assessment of the AI system performance in real conditions, and its impact on your operations when fully embedded in. It is also key to detect pitfalls and necessary model adaptations to adjust its performance to the required purpose. When external validation has not been part of the model development, we include it when auditing the AI system.

Data Considerations
Distinguishing between modeling and data aspects in the search for trustworthy AI may be many times arbitrary. As it steams from the points above, several checks on the data sampling procedure and other data integrity aspects are strongly related to model validation actions. An accurate understanding of the data pipeline is in any case essential to a trustworthy AI system development. We discussed in a previous blog entry how data quality issues are behind patterns recurrently appearing when assessing AI solutions that fail to meet their expectations, for instance in the context of covid related AI systems.
There are a lot of issues to assess related to data when it comes to privacy and data ownership as well. These are both ethical and legal considerations. Particularly when it comes to personal data, there are actual regulations established by legal frameworks such as the General Data Protection Regulation (GDPR) in Europe. When auditing, we take into account relevant considerations for the AI system around purpose limitation, data minimization, proportionality and transparency. Have a look at the white paper here, or directly at the General Data Protection Regulation text for details.
Transparency and Explainability
Depending on the AI system, transparency and explainability are not just good practice, but necessary to fulfill regulations. For instance, public administrations usually need to meet requirements regarding transparency and provide evidence when decisions involve individuals as per the current legal regulations.
In other cases, you may try to evade a proper assessment of the guidelines for trustworthy AI systems playing the IP and confidentiality card. The truth is, if you are running an AI system, and your internal documentation is not even transparent enough for you to answer the questions above, you have a problem with transparency. And that should be enough of a warning indication that the integration of the AI solution may cause legal, ethical, and competitive issues.
When it comes to explainability, there are several methods to approach it, and reduce and avoid the annoying and very dangerous notion of having an AI system that behaves like a black box. Methods target two levels of explainability.
The first one is global explainability, where we assess what is the relation between data features and AI system predictions at a statistical level, involving a big enough sample of data. This includes methods like partial plots, permutation importances, or tree based importances just to mention commonly known ones.
The second type of explainability focuses on understanding why the AI system makes a prediction for a single specific data instance, this is, the notion of local explainability. Methods for this include for instance the now extensively used shapley additive explanations, and other game theory inspired methods. As mentioned above, this second level of explainability is sometimes required by law.
In all cases, we advise algorithm explainability should be integrated while developing the AI system. When that is not the case, algorithm audit raises this issue and helps designing techniques to assess the explainability of your model.
Equal Treatment and Fairness
In this post, we have indistinctly talked about ethical principles and values for trustworthy AI, like human autonomy; about the derived requirements, like technical robustness, transparency or privacy; and also about good practices and methods to develop and audit AI systems. We strongly suggest you read the EU guidelines for a global overview of the principles, requirements and methods. We can not end without writing about another of the ethical principles for trustworthy AI: fairness.
It may seem easy to transform human values like fairness into actual model behaviors in development, but it turns out there are quite some challenges ahead. This is for instance the case when training on data sets that are strongly imbalanced, and/or if they contain bias. In that case, how you decide to implement the fairness criteria into your ML algorithm across the majority and minority classes has no unique approach. Technically, you could decide for instance to impose group fairness focusing on recall, but you could also choose precision for it, or even look at prevalence. But when the classes are imbalanced, these all lead to distinct notions of the equality and impact of the model. The audit process makes sure these notions are confronted, so that at least, you consciously pick one.
Final words
Algorithm audit contributes to making your AI system lawful, ethical and robust; the three components for a trustworthy AI. Many of the checks, methods, and practices we have described here should be integrated while developing your AI system, and continuously assessed during deployment. This is an ethical concern, but it is also about competitiveness. It applies to both in house and externally developed solutions. We strongly encourage you to challenge the AI systems you operate as a necessary practice to ensure trustworthy AI. At PickleTech we offer algorithm audits as a service, we can help you with that.
*Header image source: Pixar
[1] EU’s ethics guidelines for Trustworthy Artificial Intelligence (EN version): https://ec.europa.eu/newsroom/dae/document.cfm?doc_id=60419
[2] Auditing machine learning algorithms, white paper from the Supreme Audit Institutions of Finland, Germany, the Netherlands, Norway and the UK: Auditing machine learning algorithms (auditingalgorithms.net)
[3] General Data Protection Regulation, https://eur-lex.europa.eu/eli/reg/2016/679/oj
[4] Artificial Intelligence, ethics and Society, OEIAC (EN version): Informe_OEIAC_2021_eng-3.pdf (udg.edu)