1 Introduction
Generative models have been well researched since the introduction of the VariationalAutoencoder (VAE)
(Kingma and Welling, 2014)and Generative Adversarial Network (GAN)
(Goodfellow et al., 2014; Goodfellow, 2017). Focusing on GANs, much research has been targeted at improving model architecture and performance, e.g. StyleGAN (Karras et al., 2019) and BIGGAN (Brock et al., 2019), or finding solutions to stability issues, e.g. balancing discriminatorgenerator and mode collapse (Metz et al., 2017; Gulrajani et al., 2017; Salimans et al., 2016; Tran et al., 2018, 2019), improving dataefficiency (Tran et al., 2021), and detectability of deepgenerated images (Chandrasegaran et al., 2021). However, to our knowledge, little research has gone into addressing biases. This is an important factor to consider as it limits the potential of generative models. For instance, in a GAN application on the facial composition of criminal profiles (Jalan et al., 2020), biases in gender may result in wrongful profiling.In this work, we take a closer look at the fairness metrics for evaluating deep generative models. A deep generative model produces synthetic data . follows some model distribution imposed by the generator network. In many cases, can biased w.r.t. some targeted attribute .
is a onehot vector representation of the targeted attribute, and
is the cardinality of the attribute, e.g. if corresponds to gender, or if is a compound attribute corresponding to gender and two different hair colours.In many cases, is a latent attribute. Therefore, to evaluate the fairness of the generated data w.r.t. , one would need to determine the attribute value through an attribute classifier, (Grover et al., 2019; Choi et al., 2020; Tan et al., 2020). In particular, for an observed
, the attribute classifier
produces a soft output . Then, some discrepancy measure betweenand the uniform probability vector
can be used to quantify the fairness of , where . As per previous works (Choi et al., 2020; Xu et al., 2018; Tan et al., 2020), we utilise a uniform distribution to determine if the model has the ability to achieve statistical parity
(Caton and Haas, 2020)  each equal probability is given to each outcome.For example, in (Choi et al., 2020), the L2 norm is used to measure the fairness discrepancy (FD) and is given by:(1) 
If , then is considered to be perfectly fair w.r.t. the targeted attribute . On the other hand, taking a second look at (1), one may realize that is highly dependent on . Specifically, error in is inevitable in practice. However, its effect on has not been investigated. Even with a perfectly fair , we may not obtain due to error in .
Related work: A few other limited works have also tried to quantify fairness in GAN. (Xu et al., 2018) measures statistical parity with an additional discriminator. (Tan et al., 2020) on the other hand, utilises a similar method as (Choi et al., 2020) with the addition of determining if the contextual attributes of the generated images are maintained w.r.t to the population distribution. As a result of Xu’s dependency on the model’s architecture, we focus on the works of (Choi et al., 2020; Tan et al., 2020) whose use of an auxiliary classifier is deemed advantageous due to its ease of deployment.
To understand how different choices of may affect the validity of under errors in , we conduct an empirical study in this work. The main idea of our study is as follows. Given a deep generative model (biased or fair w.r.t. ) and an attribute classifier , we examine different choices of discrepancy measure . We compute the value , and compare it with the groundtruth discrepancy value , where is an attribute classifier with perfect accuracy. The difference between and indicates the validity of selecting under attribute classification errors in .
Our contributions are:

Identification of the effects of inaccuracies in on fairness metrics.

A methodological framework to analyse and quantify the differences in fairness metrics for generative models, and a recommendation of a robust metric.
2 Problem Setup and Analysis
2.1 Experiment Setup
In our experiments we utilise ResNet18 (He et al., 2015) as our , training it on different and configurations. We use Adam as our optimiser with . Suppose that for a dataset, the groundtruth distribution of some attributes is denoted by . In order to evaluate the effect of inaccuracies in C on a given discrepancy measure D in a controlled setting, first we need to simulate a generator that generates the data with assumed attribute distribution . To this end, we simply construct a dataset by sampling an available generated dataset (e.g. CelebA (Liu et al., 2015)). The sampling helps to not concern ourselves with the quality or diversity of . For example, if corresponds to gender (), and , given that has 100 samples for each of male and female, we randomly sample 90 males and 10 females. Then we apply the attribute classifier and calculate the approximated distribution which can be used to calculate .
Lastly, to mimic the perfect classifier , we forgo the sampling and directly calculate via , i.e. , and hence . thus indicates the deviation of the fairness score as a result of inaccuracies in .
2.2 Analysis of Previous Works
We emphasise a few key requirements that are necessary for effective FD measurement: 1) has to be well defined such that fair representation is achievable 2) has to have relatively high accuracy with low accuracy variability between the various 3) Appropriate metric has to be utilised such that the FD score is robust against noise such as inaccuracies in and selection of hyperparameters, e.g. . We focus on criteria 3 in this paper and highlight a few properties that need to be addressed.
We first clarify a few notations and terms.
The extreme points (EP) in the distribution include the ideal worst case bias scenario which we call absolutely bias (ABEP), and also the ideal fair scenario (FairEP). Using the previous gender example (), there are two ABEPs: , and , and the FairEP would be . Following this, the largest possible fairness discrepancy score achievable with is denoted by , which is the for each of ABEPs. Utilising our previous example, , where .
Note that .
With that, we highlight a few weaknesses of the past works with reference to FD (1) and propose solutions that would mitigate these weaknesses.
a) Scale: The current metrics do not have a consistent upper bound, thereby making experimental comparison difficult.
For example, as attribute size increases, the increases, i.e. , and , . Hence, we require a metric that has a fixed scale which does not vary with hyperparameter changes.
Solution: As an easy fix, we propose to normalise each of our proposed metrics with their , such that FD score . As per the previous example, if given a where and , our normalised score is . Note that each metric with different would have a different normalisation factor , as per Annex A.1 Table 2. This normalisation would fix all metrics’ scales , allowing comparison between them. We thus utilised this normalisation technique for all subsequent FD scores, where all and can be assumed to have been normalised.
b) Inaccuracy in :
The imperfect presents a challenging problem where inaccuracies are propagated to the FD scores, making the measure unreliable. We demonstrate this by comparing the normalised FD score, of our proposed metrics introduced later in section 3, on four different of various accuracies and configurations. These were trained on the CelebA data set whose attributes include 1) Gender 2) Youth 3) Male and blackhair 4) Young and smiling.
Pinching effect: In Fig 1 and 2 we observe that a decrease in the accuracy of results in a ”pinching effect” whereby at fairEP, increases and at ABEP, decreases. This causes deviation from the theoretical optimal score of 0 and 1. This is in line with our intuition that a less accurate tends towards a random classification, resulting in having a more uniform distribution.
Internal Variability: Next, there exists internal variability where different have different classification accuracies, e.g. Fig.1 the first classifier on the left has accuracies of 0.98 and 0.95 for attributes [1,0] and [0,1] respectively. These varying accuracies form a bias that propagates to the FD score. Thus of the same shape but on different supports may produce different scores, e.g. in Fig 1, each ABEP measures different scores even though . This internal variability worsens as increases due to the increased difficulty to train (see Annex A). Hence, we require a metric that is robust to these inaccuracies and whose measurements are close to .
3 Proposed Metrics
Generative models cannot utilise the traditional measurement of fairness as classifiers, e.g. Equalised Odds, Equalised Opportunity
(Hardt et al., 2016) and Demographic Parity (Feldman et al., 2015), as a result of their different objectives. Instead, we can evaluate the fairness metric as a problem of measuring similarities between probability distribution
and , i.e. . Similarities refer to how the shape of the distributions resemble one another. In this section, we explore the various utilised to describe this similarity. (Charfi et al., 2020) differentiate these similarity measures into two categories, amorphic and morphic.The following notations and denotes the outcome in the distribution, e.g. , then would be the probability for . Amorphic metrics are direct “pointtopoint” measurements between distributions, e.g. Normalised Manhattan Distance (L1), Normalised Euclidean Distance (L2) and Wasserstein Distance (WD)/Earthmover Distance.
(2) 
(3) 
(4)  
On the other hand, morphic metrics describes the shape of the graph through quantifying relevant information in the distributions. For instance, Specificity Measurements (5) (Charfi et al., 2020), describes the variability in the distribution. (6) then describes the similarity measure, the difference between the distributions’ variability. However, for simplicity, we refer to this metric as specificity, since .
(5)  
(6) 
A hybrid measure also exists, that take the combination of morphic and amorphic metrics, e.g. information specificity (IS) (7)(Charfi et al., 2020) utilises a combination of the L1’s amorphic pointtopoint measurement to determine the displacement between the two distribution as well as specificity that measure the difference in distributions’ variation. We set for (7), in our experiments.
(7)  
We did not consider KLdivergence as per (Tan et al., 2020) as it resulted in computational problems when the support of and were different, e.g. at ABEP. In addition, it’s nonsymmetrical properties poses other problem beyond the scope of this paper.
3.1 Performance Benchmark
We utilise the following benchmarks to identify the ideal metric that is robust against inaccuracies in and hyperparameter changes.
Mean extreme point error (MEPE) (8) and (9) measures the metrics’ deviations from the theoretical boundaries, 0 and 1 respectively. Note that is an average across multiple ABEP. denotes a set of approximated ABEP/fairEP distribution classified by where is the number of fairEP/ABEP in the set. For example k=2 and 4, when calculating we utilise ={[1,0],[0,1]} and ={[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]} hence, . We mix the errors across different in order to find a metric independent of .
(8) 
(9) 
Extreme point variability () (10) helps indicate the overall stability of the metrics by measuring the variability of at fairEP/ABEP. indicates the average FD score.
(10) 
Mean error measurement (MEM) (11) determines how far, on average, the approximated score utilising deviates from the theoretical ideal score. and denotes the set of approximated and groundtruth sample distributions respectively. This metric thus indicates the overall effects that external factors, e.g. classifier’s accuracy has on the .
(11) 
Note that (8),(9),(10) are calculate across ={2,4,8,16} in our experiments.
4 Experiments
4.1 Experiment Setup
Next, we conduct the experimental evaluation of the various metrics.
We again utilise CelebA to train the next set of for the remaining experiments. {gender, black hair, smiling and bangs} are incrementally trained into 4 different of increasing . Without loss of generality, the attributes are binary and scale exponentially, after permutations.
In 4.2 and 4.3, we measure fairness metrics response to varying k={2,4,8,16}. We first analyse the fairness metric on their EPs followed by varying the distribution from ABEP to fairEP, generated with algorithm 1.
Benchmark Point  l2  L1  IS  Specificity  WD 
Mean Extreme point Error  
(FairEP)  0.0588  0.0851  0.0361  0.0276  0.0851 
(ABEP)  0.2106  0.1884  0.2273  0.2338  0.1884 
Extreme Point Variability  
(FairEP)  0.0349  0.0657  0.0207  0.0168  0.0657 
(ABEP)  0.1058  0.0698  0.1149  0.1212  0.0698 
Mean Error Measurement  
(2ATTR Sweep)  0.0184  0.0184  0.0184  0.0184  0.0184 
(4ATTR Sweep)  0.0884  0.0623  0.0919  0.1034  0.0623 
(8ATTR Sweep)  0.1386  0.0798  0.1535  0.1727  0.0798 
(16ATTR Sweep)  0.1761  0.1155  0.19  0.2031  0.1155 
4.2 Extreme Points Analysis
FairEP vs ABEP: Overall, we observe as per Table 1 that fairEP scores are closer to the theoretical value, i.e. lower error, than the ABEP. This occurs due to trendline effects, which we will discuss in 4.3, as well as intervariability discussed in 2.2, where a particular may have poorer accuracy and hence larger error in its ABEP score, e.g. Fig 1,
. However, when measuring fairEP any poor estimation by
is averaged across all permutations of due to the uniform distribution being sampled. This contributes to the lower error measurement in . Hence, this results in better estimation of the fairEP with lower in comparison to as per Table 1. We call this the internal variability effect (see Annex D).ABEP Analysis: We utilise quantitative analysis to study the metrics behaviour of FD score and its variability at ABEP, as seen in Fig 5 and 6 respectively. Specificity performs the worst with the highest , regardless of size. It also has the highest , making it the least consistent fairness metric at ABEP. Conversely, WD and L1 performed the best with the lowest and . As used is relatively small WD and L1 are generally identical to one another when the scores are rounded to 4 D.P.
We identify the advantages of the L1 metric in comparison to L2, specificity and by extension IS. L1’s lower variability is attributed to the simplicity of the metrics linear scale, making it less susceptible to noise from volatility in . Whereas, specificity on the other hand measures variability in
and therefore is more sensitive to the noise prevalent at ABEP. This influences the larger variance in its score as well as score deviations from
. This volatility is further amplified by the decrease in accuracy as k increases. To validate this, Annex A Fig 11 shows specificity having the greatest increase in error when k increased.FairEP Analysis: Utilising Fig 7 and 8, the metric performance at FairEP are the converse of ABEP. L1/WD performs the worst and specificity the best. Amorphic metrics’ have smaller . As such their fairness scores are largely scaled up during normalisation, resulting in L1, L2 and WD relatively larger fairness scores and hence larger . This scaling effect is amplified with increasing , Annex A.1 Table 2 for all the . Specificity on the other hand regardless of the size of , hence no scaling occurs, thereby attaining the lowest . Furthermore, as per the internal variability effect, average accuracy is generally higher at the fairEP, making specificity more stable thereby attaining the lowest .
4.3 Metric Trend line Analysis
The trend line analysis studies the behaviour of the metrics with increasing fairness distribution. Each increment in the fairness epoch tends towards a more uniform distribution. Fig
3 shows that the metrics’ general trends follow our expectation, where the highest score is seen at the beginning which decreases with each epoch. Overall L1/WD performed the best with the lowest MEM score and specificity the worst. A few interesting observations were seen in Fig.4 where 1)All metrics began with a large error that decreases with each fairness epoch and 2) L1’s momentary increment in fairness score at epoch 38 to 52.General reduction in error: The initial large error seen in Fig 4 epoch 1, is the result of inaccuracy in . This is a similar observation as discussed in 4.2 APEP analysis. Additionally, we observed a decrease in trendline gradient, i.e. gradient accuracy which creates deviation from its ideal trend, seen in Annex C Fig 14. The change in gradient then results in the gradual convergence towards the ideal scores, with each increment in fairness epoch, i.e. reduction in error. This same trend holds for =4 and 16, seen in Annex C, where the former experiences less deviation as a result of its more accurate and the converse for the latter. Furthermore, we note that when starting at different ABEP the convergence rate are different, where the ABEP whose has the lowest accuracy converges the slowest (see Annex D). Lastly, we notice that not all metrics converge at the same rate, we address this in the following.
L1 trend abnormality: The increment in L1’s fairness score is the result of the metric’s ability to aggressively correct its error. We attribute this to L1’s simplistic linear equation that is least susceptible to noise caused by inaccuracies in and hence the quickest to correct. This inference is supported by the trend that specificity, the metric most sensitive to poor accuracy in , having the largest error and slowest convergence. Followed by IS who is influenced by specificity and finally L2, whose quadratic calculation creates larger variations from the true measurement as seen previously in 4.2. The L1 aggressive correction makes it the most robust metric to poor accuracy in . These trends are similarly observed in and as per Annex C.
5 Future works, Conclusion and Metric Recommendation
Future works: our work defines fairness as a uniform distribution with the motivation of data augmentation where even representation of, e.g. hair colour, is ideal for balance learning . However, this definition may vary where fairness could mean ”to follow the population distribution”, e.g. the racial demographic of a country, which may not be uniform. Thus future work could explore this domain.
To conclude, we have presented the existing problems in the current FD score and introduced a variety of different morphic, amorphic and hybrid fairness metrics to help mitigate the problem. We introduced performance benchmarks to determine the ideal metric. Through experimentation, we observed, that L1/WD metric performed the best at ABEP and specificity at FairEP as per Table 1. Furthermore, L1/WD demonstrates to be the most robust, having the lowest MEM. As such, there is no clear distinct optimal metric that favours all components of our benchmark. Thus, we recommend , hybrids metrics, IS to be used as the ideal middle ground. We also point out the unlikelyhood that the generated distribution will tend towards ABEP. Hence, emphasis should be placed on FairEP, which IS does.
Acknowledgements
We would like to thank Milad Abdollahzadeh for his constructive comments on the manuscript.
References
 Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv:1809.11096 [cs, stat] (en). External Links: 1809.11096 Cited by: §1.

Fairness in Machine Learning: A Survey
. arXiv:2010.04053 [cs, stat] (en). External Links: 2010.04053 Cited by: §1. 
A closer look at fourier spectrum discrepancies for cnngenerated images detection.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 7200–7209. Cited by: §1. 
Possibilistic Similarity Measures for Data Science and Machine Learning Applications
. (en). Note: https://hal.archivesouvertes.fr/hal02890097 Cited by: §3, §3, §3.  Fair Generative Modeling via Weak Supervision. arXiv:1910.12008 [cs, stat]. External Links: 1910.12008 Cited by: §1, §1.
 Certifying and removing disparate impact. arXiv:1412.3756 [cs, stat]. External Links: 1412.3756 Cited by: §3.
 Generative adversarial nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. 2672–2680. External Links: Link Cited by: §1.
 NIPS 2016 Tutorial: Generative Adversarial Networks. arXiv:1701.00160 [cs] (en). External Links: 1701.00160 Cited by: §1.
 Bias Correction of Learned Generative Models using LikelihoodFree Importance Weighting. arXiv:1906.09531 [cs, stat]. External Links: 1906.09531 Cited by: §1.
 Improved Training of Wasserstein GANs. arXiv:1704.00028 [cs, stat] (en). External Links: 1704.00028 Cited by: §1.

Equality of Opportunity in Supervised Learning
. arXiv:1610.02413 [cs]. External Links: 1610.02413 Cited by: §3.  Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] (en). External Links: 1512.03385 Cited by: §2.1.
 Suspect face generation. In 2020 3rd International Conference on Communication System, Computing and IT Applications (CSCITA), Vol. , pp. 73–78. External Links: Document Cited by: §1.
 A StyleBased Generator Architecture for Generative Adversarial Networks. arXiv:1812.04948 [cs, stat] (en). External Links: 1812.04948 Cited by: §1.
 AutoEncoding Variational Bayes. arXiv:1312.6114 [cs, stat]. External Links: 1312.6114 Cited by: §1.
 Deep Learning Face Attributes in the Wild. arXiv:1411.7766 [cs]. External Links: 1411.7766 Cited by: §2.1.
 Unrolled Generative Adversarial Networks. arXiv:1611.02163 [cs, stat] (en). External Links: 1611.02163 Cited by: §1.
 Improved Techniques for Training GANs. arXiv:1606.03498 [cs] (en). External Links: 1606.03498 Cited by: §1.
 Improving the Fairness of Deep Generative Models without Retraining. arXiv:2012.04842 [cs]. External Links: 2012.04842 Cited by: §1, §1, §3.
 Distgan: an improved gan using distance constraints. In ECCV, Cited by: §1.
 Selfsupervised gan: analysis and improvement with multiclass minimax game. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §1.
 On data augmentation for gan training. IEEE Transactions on Image Processing 30 (), pp. 1882–1897. External Links: Document Cited by: §1.
 FairGAN: Fairnessaware Generative Adversarial Networks. arXiv:1805.11202 [cs, stat] (en). External Links: 1805.11202 Cited by: §1, §1.
Appendix A Metric Study
The following graph shows the variability of the accuracy between ABEP as the dimension of increases. As previously discussed, the variability begins to increase in addition to the reduction in accuracy as the increases. This indicates that would have an internal bias to certain attributes.
With this internal bias, we further in Fig 10, 11 we observe that the extreme point errors increase with . Additionally, we begin to see a clear distinction between the metrics. In FairEP, specificity has the least error regardless of , whereas WD begins diverging with increase error. However, the converse is seen for ABEB.
a.1 Normalisation factor
The Normalisation factors are in Table 2, where the values in each cell represented the theoretical ceiling of each metric. Hence, the normalisation factor is an important addition to ensure that the fairness scores thereby allowing the metrics to have a direct comparison to one another. Furthermore, the metric would have little meaning as the theoretical ceiling changes according to attribute size. This implies that the fairness score could be synthetically improved simply by increasing .
Nonetheless we are aware that the the normalisation factor does create some problems. For example, in fairEP we observe the raw metric to have the following scores , , , and . It is clear the L1 achieves a lower score that specificity in the raw score. However, as a result of the large difference in the normalisation factor upon normalisation . Now, L1 is greater than specificity as a result of its smaller normalisation factor.
On the other hand, when normalisation factor difference is small such as L2 and l1 we observe less significant effects. For example where the raw fairness scores are and Normalised scores . L1 remains larger than L2 regardless of L2’s smaller normalisation factor.
Attributes  l2  l1  IS  Specificity  wd 

2  0.353553391  0.5  0.75  1  0.5 
4  0.216506351  0.375  0.6875  1  0.375 
8  0.116926793  0.21875  0.609375  1  0.21875 
16  0.060515365  0.1171875  0.55859375  1  0.1171875 
Appendix B Classifier’s attributes
Attributes  dimension of  Accuracy 

Gender  2  0.98 
Youth  2  0.81 
Male,black hair  4  0.83 
young, Smiling  4  0.72 
Attributes  dimension of  Accuracy 

Gender  2  0.98 
Gender, blackhair  4  0.86 
Gender,black hair, Smiling  8  0.78 
Gender, black hair, Smiling, bangs  16  0.66 
Appendix C Trendline Sweep
Appendix D Internal variability
We observe that as mentioned in the trend line sweep that the metric generally trend downwards with increase fairness in the distribution. We ran the trendline sweep starting from the respective ABEP, e.g. {1,0,0,0} , {0,1,0,0}, {0,0,1,0} and {0,0,0,1} and measured the standard deviation. As per Figures 21,27, we observe that as the fairness epoch increase there is a decrease in the variability of the metric which we deem the internalvariability effect. This observation is in addition to the general decrease in the fairness score on all metrics. Fig 22,23,24,25 shows the individual metrics and their normalised scores at each fairness epoch.
Fig 28 further demonstrates the internalvariability effect by showing that the error converges when at a lower point when comparing the ABEP with the highest accuracy 0.65 against 0.87. Furthermore, when looking at the l1 metric on the two extreme accuracies in Fig 28 and 29, we see the different impacts that the internal variability effect has on the various extreme points. in orange having the worst accuracy observe a steep improvement in its error. Whereas having the highest accuracy observe a slower gradual improvement.
Comments
There are no comments yet.