Network reconstruction based on synthetic data generated by a Monte Carlo approach

Masiar Novine; Cecilie Cordua Mattsson; Detlef Groth

doi:10.52905/hbph2021.3.26

Authors

Masiar Novine University of Potsdam, Institute of Biochemistry and Biology, Bioinformatics Group, 14469 Potsdam, Germany https://orcid.org/0000-0001-9687-8675
Cecilie Cordua Mattsson ADBOU, Institute of Forensic Medicine, University of Southern Denmark, Campusvej 55, DK 5230 Odense M, Denmark https://orcid.org/0000-0002-9110-6550
Detlef Groth University of Potsdam, Institute of Biochemistry and Biology, Bioinformatics Group, 14469 Potsdam, Germany https://orcid.org/0000-0002-9441-3978

DOI:

https://doi.org/10.52905/hbph2021.3.26

Keywords:

Monte Carlo method, network reconstruction, mcgraph, random sampling, linear enamel hypoplasia

Abstract

Background: Network models are useful tools for researchers to simplify and understand investigated systems. Yet, the assessment of methods for network construction is often uncertain. Random resampling simulations can aid to assess methods, provided synthetic data exists for reliable network construction.

Objectives: We implemented a new Monte Carlo algorithm to create simulated data for network reconstruction, tested the influence of adjusted parameters and used simulations to select a method for network model estimation based on real-world data. We hypothesized, that reconstructs based on Monte Carlo data are scored at least as good compared to a benchmark.

Methods: Simulated data was generated in R using the Monte Carlo algorithm of the mcgraph package. Benchmark data was created by the huge package. Networks were reconstructed using six estimator functions and scored by four classification metrics. For compatibility tests of mean score differences, Welch’s t-test was used. Network model estimation based on real-world data was done by stepwise selection.

Samples: Simulated data was generated based on 640 input graphs of various types and sizes. The real-world dataset consisted of 67 medieval skeletons of females and males from the region of Refshale (Lolland) and Nordby (Jutland) in Denmark.

Results: Results after t-tests and determining confidence intervals (CI95%) show, that evaluation scores for network reconstructs based on the mcgraph package were at least as good compared to the benchmark huge. The results even indicate slightly better scores on average for the mcgraph package.

Conclusion: The results confirmed our objective and suggested that Monte Carlo data can keep up with the benchmark in the applied test framework. The algorithm offers the feature to use (weighted) un- and directed graphs and might be useful for assessing methods for network construction.

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control 19 (6), 716–723. https://doi.org/10.1109/TAC.1974.1100705.

Barabási, A.-L. (1999). Emergence of scaling in random networks. Science 286 (5439), 509–512. https://doi.org/10.1126/science.286.5439.509.

Barabási, A.-L. (2007). Network medicine – from obesity to the "Diseasome". The New England Journal of Medicine 357 (4), 404–407. https://doi.org/10.1056/NEJMe078114.

Barabási, A.-L./Gulbahce, N./Loscalzo, J. (2011). Network medicine: a network-based approach to human disease. Nature Reviews Genetics 12 (1), 56–68. https://doi.org/10.1038/nrg2918.

Barabási, A.-L./Oltvai, Z. N. (2004). Network biology: understanding the cell's functional organization. Nature Reviews Genetics 5 (2), 101–113. https://doi.org/10.1038/nrg1272.

Barabási, A.-L./Pósfai, M. (2016). Network science. Cambridge, Cambridge University Press.

Batushansky, A./Toubiana, D./Fait, A. (2016). Correlation-Based Network Generation, Visualization, and Analysis as a Powerful Tool in Biological Studies: A Case Study in Cancer Cell Metabolism. BioMed Research International 2016, 8313272. https://doi.org/10.1155/2016/8313272.

Berrar, D./Granzow, M./Dubitzky, W. (2007). Fundamentals of data mining in genomics and proteomics. Boston, MA, Springer; Springer US.

Boyd, K./Santos Costa, V./Davis, J./Page, C. D. (2012). Unachievable region in precision-recall space and its effect on empirical evaluation. In: J. Langford/J. Pineau (Eds.). Proceedings of the 29th International Conference on Machine Learning // Proceedings of the Twenty-Ninth International Conference on Machine Learning. Edinburgh, [International Machine Learning Society], 1616–1626.

Breiman, L./Friedman, J. H./Olshen, R. A./Stone, C. J. (1984). Classification and regression trees. Belmont, Calif., Wadsworth.

Büttner, K./Salau, J./Krieter, J. (2016). Adaption of the temporal correlation coefficient calculation for temporal networks (applied to a real-world pig trade network). SpringerPlus 5, 165. https://doi.org/10.1186/s40064-016-1811-7.

Cao, C./Chicco, D./Hoffman, M. M. (2020). The MCC-F1 curve: a performance evaluation technique for binary classification. https://doi.org/10.48550/arXiv.2006.11278.

Chicco, D./Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21 (1), 6. https://doi.org/10.1186/s12864-019-6413-7.

Christakis, N. A./Fowler, J. H. (2007). The spread of obesity in a large social network over 32 years. New England Journal of Medicine 357 (4), 370–379. https://doi.org/10.1056/NEJMsa066082.

Copas, J. B./Long, T. (1991). Estimating the residual variance in orthogonal regression with variable selection. The Statistician 40 (1), 51–59. https://doi.org/10.2307/2348223.

Dahl, D. B./Scott, D./Roosen, C./Magnusson, A./Swinton, J. (2000). xtable: Export Tables to LaTeX or HTML. Available online at https://CRAN.R-project.org/package=xtable (accessed 5/31/2022).

Eddelbuettel, D./François, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software 40 (8), 1–18. https://doi.org/10.18637/jss.v040.i08.

Eddelbuettel, D./Sanderson, C. (2014). RcppArmadillo: Accelerating R with high-performance C++ linear algebra. Computational Statistics and Data Analysis 71, 1054–1063. https://doi.org/10.1016/j.csda.2013.02.005.

Efron, B./Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1 (1), 54–75. https://doi.org/10.1214/ss/1177013815.

Frayling, T. M./Timpson, N. J./Weedon, M. N./Zeggini, E./Freathy, R. M./Lindgren, C. M./Perry, J. R. B./Elliott, K. S./Lango, H./Rayner, N. W./Shields, B./Harries, L. W./Barrett, J. C./Ellard, S./Groves, C. J./Knight, B./Patch, A./Ness, A. R./Ebrahim, S./Lawlor, D. A./Ring, S. M./Ben-Shlomo, Y./Jarvelin, M.-R./Sovio, U./Bennett, A. J./Melzer, D./Ferrucci, L./Loos, R. J. F./Barroso, I./Wareham, N. J./Karpe, F./Owen, K. R./Cardon, L. R./Walker, M./Hitman, G. A./Palmer, C. N. A./Doney, A. S. F./Morris, A. D./Smith, G. Davey/Hattersley, A. T./McCarthy, M. I. (2007). A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316 (5826), 889–8894. https://doi.org/10.1126/science.1141634.

Friedman, J./Hastie, T./Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 (3), 432–441. https://doi.org/10.1093/biostatistics/kxm045.

Ghazalpour, A./Doss, S./Zhang, B./Wang, S./Plaisier, C./Castellanos, R./Brozell, A./Schadt, E. E./Drake, T. A./Lusis, A. J./Horvath, S. (2006). Integrating genetic and network analysis to characterize genes related to mouse weight. PLOS 2 (8), 1182–1192. https://doi.org/10.1371/journal.pgen.0020130.

Groth, D./Novine, M. (2022). mcgraph. Available online at https://github.com/MasiarNovine/mcgraph (accessed 1/18/2022).

Groth, D./Scheffler, C./Hermanussen, M. (2019). Body height in stunted Indonesian children depends directly on parental education and not via a nutrition mediated pathway - Evidence from tracing association chains by St. Nicolas House Analysis. Anthropologischer Anzeiger 76 (5), 445–451. https://doi.org/10.1127/anthranz/2019/1027.

Hanley, J. A./McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143 (1), 29–36. https://doi.org/10.1148/radiology.143.1.7063747.

Harrell, F. E. (2001). Regression modeling strategies - with applications to linear models, logistic regression, and survival analysis. 2nd ed. New York, Springer.

Heinze, G./Dunkler, D. (2017). Five myths about variable selection. Transplant International 30 (1), 6–10. https://doi.org/10.1111/tri.12895.

Heinze, G./Wallisch, C./Dunkler, D. (2018). Variable selection - A review and recommendations for the practicing statistician. Biometrical Journal 60 (3), 431–449. https://doi.org/10.1002/bimj.201700067.

Hermanussen, M./Aßmann, C./Groth, D. (2021). Chain reversion for detecting associations in interacting variables - St. Nicolas house analysis. International journal of environmental research and public health 18 (4), 1741. https://doi.org/10.3390/ijerph18041741.

Huberty, C. J. (1989). Problems with stepwise methods – better alternatives. Advances in Social Science Methodology (1), 43–70.

Jiang, H./Fei, X./Liu, H./Roeder, K./Lafferty, J./Wasserman, L./Li, X./Zhao, T. (2021). High-dimensional undirected graph estimation. Available online at https://cran.r-project.org/web/packages/huge/huge.pdf (accessed 1/18/2022).

Langfelder, P./Horvath, S. (2008). WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559. https://doi.org/10.1186/1471-2105-9-559.

Loscalzo, J./Barabási, A.-L./Silverman, E. (2017). Network medicine: Complex systems in human disease and therapeutics. Cambridge, Harvard University Press.

Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405 (2), 442–451. https://doi.org/10.1016/0005-2795(75)90109-9.

Mattsson, C. C. (2021). Correlation between childhood episodes of stress and long bone-ratios in samples of medieval skeletons - using linear enamel hypoplasia as proxy. Human Biology and Public Health 3. https://doi.org/10.52905/hbph2021.3.23.

Meinshausen, N./Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics 34 (3), 1436–1462. https://doi.org/10.1214/009053606000000281.

Meinshausen, N./Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society. Series B, Statistical Methodology 72 (4), 417–473. https://doi.org/10.1111/j.1467-9868.2010.00740.x.

Metropolis, N./Ulam, S. (1949). The Monte Carlo method. Journal of the American Statistical Association 44 (247), 335–341. https://doi.org/10.2307/2280232.

Milner, G. R./Boldsen, J. L. (2012). Transition analysis: a validation study with known-age modern American skeletons. American Journal of Physical Anthropology 148 (1), 98–110. https://doi.org/10.1002/ajpa.22047.

Nicosia, V./Tang, J./Mascolo, C./Musolesi, M./Russo, G./Latora, V. (2013). Graph metrics for temporal networks. In: P. Holme/J. Saramäki (Eds.). Temporal networks. Petter Holme; Jari Saramäki, eds. Heidelberg, Springer, 15–40.

R Core Team (2021). R: a language and environment for statistical computing. R Foundation for Statistical Computing. Available online at https://www.r-project.org/.

Rice, J. J./Tu, Y./Stolovitzky, G. (2005). Reconstructing biological networks using conditional correlation analysis. Bioinformatics 21 (6), 765–773. https://doi.org/10.1093/bioinformatics/bti064.

Saito, T./Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE 10 (3), 1–21. https://doi.org/10.1371/journal.pone.0118432.

Sakamoto, Y./Ishiguro, M./Kittagawa, G. (1986). Akaike information criterion statistics. Dordrecht, Reidel.

Sanderson, C./Curtin, R. (2016). Armadillo: a template-based C++ library for linear algebra. Journal of Open Source Software 1 (2), 26. https://doi.org/10.21105/joss.00026.

Sanderson, Conrad/Curtin, Ryan (2018). A user-friendly hybrid sparse matrix class in C++. In: J. H. Davenport/M. Kauers/G. Labahn et al. (Eds.). Mathematical Software – ICMS 2018. 6th International Conference, South Bend, IN, USA, July 24-27, 2018, Proceedings. Cham, Springer International Publishing, 422–430.

Smith, G. (2018). Step away from stepwise. Journal of Big Data 5 (1), 32. https://doi.org/10.1186/s40537-018-0143-6.

Sulaimanov, N./Koeppl, H. (2016). Graph reconstruction using covariance-based methods. EURASIP Journal on Bioinformatics and Systems 19 // 2016 (1), 267–288. https://doi.org/10.1186/s13637-016-0052-y.

Tarp, P. (2017). Skeletal age estimation: a demographic study of the population of Ribe through 1000 years. Ph.D. dissertation. Odense, Syddansk Universitet.

Wasserman, L. (2013). All of statistics: a concise course in statistical inference. A concise course in statistical inference. New York, Springer.

Wasserman, S./Faust, K. (1994). Social network analysis: methods and applications. Cambridge, Cambridge University Press.

Wickham, H. (2016). ggplot2: elegant graphics for data analysis. 2nd ed. Cham, Springer.

Xie, Y. (2021). knitr: A General-purpose package for dynamic report generation in R. Available online at https://yihui.org/knitr/.

Zhang, B./Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology 4, 17. https://doi.org/10.2202/1544-6115.1128.

Zhao, T./Liu, H./Roeder, K./Lafferty, J./Wasserman, L. (2012). The huge package for high-dimensional undirected graph estimation in R. Journal of Machine Learning Research 13, 1059–1062.