Topics in Data Management and Information: Short Courses of SBBD 2024
Keywords:
Retrieval-Augmented Generation, Knowledge Graphs, Execution Plans, Visual Approach, Query Exploration, Query Optimization, SQL, Pattern Mining in GraphsSynopsis
This book includes three chapters written by the authors of the selected tutorials presented during the 39th Brazilian Symposium on Databases (SBBD 2024), held from September 14 to 17, 2024. They aim to present relevant topics related to Databases. Moreover, they promote discussions on the topics' fundamentals, trends, and challenges. Each short course lasts four hours and is an excellent opportunity to update academics and professionals participating in the event.
The chapters cover content related to Retrieval-Augmented Generation (RAG), Query Exploration and Optimization, as well as Graph Pattern Mining. The short course program committee was composed of José Maria da Silva Monteiro Filho (UFC), Humberto Razente (UFU), and Ronaldo dos Santos Mello (UFSC) under the coordination of the former.
The richness of this issue can be mainly credited to the authors and reviewers. We greatly thank them for their insightful contributions and discussions during SBBD 2024.
Chapters
-
1. Generation with Retrieval-Augmented Generation (RAG) in Knowledge Graphs
-
2. Unveiling Execution Plans: A Visual Approach for Query Exploration and Optimization
-
3. Practical Graph Pattern Mining: Systems, Applications, and Challenges
Downloads
References
Agrawal, M., Zitnik, M., and Leskovec, J. (2018). Large-scale analysis of disease pathways in the human interactome. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 23:111–122.
Allemang, D. and Hendler, J. (2011). Semantic web for the working ontologist: effective modeling in RDFS and OWL. Elsevier.
Angles, R. and Gutierrez, C. (2008). Survey of graph database models. ACM Computing Surveys (CSUR), 40(1):1–39.
Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J., and Vrgo ˇc, D. (2017). Foundations of modern query languages for graph databases. ACM Computing Surveys (CSUR), 50(5):1–40.
Aquino, S. B. F. (2023). Strategies for efficient subgraph enumeration on GPUs. Phd thesis, Federal University of Minas Gerais. Available at [link].
Benson, A. R., Gleich, D. F., and Leskovec, J. (2016). Higher-order organization of complex networks. Science.
Bertino, E., Ooi, B. C., Sacks-Davis, R., Tan, K.-L., Zobel, J., Shidlovsky, B., and Andronico, D. (2012). Indexing techniques for advanced database systems, volume 8. Springer Science & Business Media.
Bindschaedler, L., Malicevic, J., Lepers, B., Goel, A., and Zwaenepoel, W. (2021). Tesseract: Distributed, General Graph Pattern Mining on Evolving Graphs, page 458–473. Association for Computing Machinery, New York, NY, USA.
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26.
Borgelt, C. (2007). Canonical forms for frequent graph mining. In Decker, R. and Lenz, H. J., editors, Advances in Data Analysis, pages 337–349, Berlin, Heidelberg. Springer Berlin Heidelberg.
Bringmann, B. and Nijssen, S. (2008). What is frequent in a single graph? In Proceedings of the 12th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’08, pages 858–863, Berlin, Heidelberg. Springer-Verlag.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Buehrer, G. and Chellapilla, K. (2008). A scalable pattern mining approach to web graph compression with communities. In Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, page 95–106, New York, NY, USA. Association for Computing Machinery.
Chamberlin, D. D. (2012). Early history of sql. IEEE Annals of the History of Computing, 34(4):78–82.
Chen, X. and Arvind (2022). Efficient and scalable graph pattern mining on GPUs. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 857–877, Carlsbad, CA. USENIX Association.
Chen, X., Dathathri, R., Gill, G., and Pingali, K. (2020). Pangolin: An efficient and flexible graph mining system on cpu and gpu. Proc. VLDB Endow., 13(8):1190–1205.
Corporation, N. (2024). NVIDIA Website. [link]. [Online; accessed 5-August-2024].
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dias, V., Teixeira, C. H. C., Guedes, D., Meira Jr., W., and Parthasarathy, S. (2019). Fractal: A general-purpose graph pattern mining system. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD).
dos Santos Dias, V. V. (2023). Graph pattern mining: consolidating models, systems, and abstractions. Phd thesis, Federal University of Minas Gerais. Available at [link].
Dourisboure, Y., Geraci, F., and Pellegrini, M. (2009). Extraction and classification of dense implicit communities in the web graph. ACM Trans. Web, 3(2).
Elbassuoni, S. and Blanco, R. (2011). Keyword search over rdf graphs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pages 237–242, New York, NY, USA. ACM.
Elseidy, M., Abdelhamid, E., Skiadopoulos, S., and Kalnis, P. (2014). Grami: Frequent subgraph and pattern mining in a single large graph. Proc. VLDB Endow., 7(7):517–528.
Fan, W. (2022). Big graphs: Challenges and opportunities. Proc. VLDB Endow., 15(12):3782–3797.
Ferraz, S., Dias, V., Teixeira, C. H., Parthasarathy, S., Teodoro, G., and Meira, W. (2024). Dumato: An efficient warp-centric subgraph enumeration system for gpu. Journal of Parallel and Distributed Computing, 191:104903.
Ferraz, S., Dias, V., Teixeira, C. H., Teodoro, G., and Meira, W. (2022). Efficient strategies for graph pattern mining algorithms on gpus. In 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 110–119.
Gabbrielli, M. and Martini, S. (2023). Functional programming paradigm. In Programming Languages: Principles and Paradigms, pages 335–368. Springer.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., and Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
Graefe, G. (1994). Volcano/spl minus/an extensible and parallel query evaluation system. IEEE Transactions on Knowledge and Data Engineering, 6(1):120–135.
Hammer, J. and Schneider, M. (2018). Data structures for databases. In Handbook of Data Structures and Applications, pages 967–981. Chapman and Hall/CRC.
Henderson, M., Al-Rfou, R., Strope, B., Sung, Y.-H., Lukács, L., Guo, R., Kumar, S., Miklos, B., and Kurzweil, R. (2017). Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652.
Hoffman, F. and Krasle, D. (2015). Fraud detection using network analysis. Patent No. EP2884418A1, Filed September 1st., 2014, Issued June 17th., 2015.
Hogan, A., Blomqvist, E., Cochez, M., D’amato, C., Melo, G. D., Gutierrez, C., Kirrane, S., Gayo, J. E. L., Navigli, R., Neumaier, S., Ngomo, A.-C. N., Polleres, A., Rashid, S. M., Rula, A., Schmelzeisen, L., Sequeda, J., Staab, S., and Zimmermann, A. (2021). Knowledge graphs. ACM Comput. Surv., 54(4).
Hong, S., Kim, S. K., Oguntebi, T., and Olukotun, K. (2011). Accelerating cuda graph algorithms at maximum warp. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP ’11, page 267–276, New York, NY, USA. Association for Computing Machinery.
Hooi, B., Shin, K., Lamba, H., and Faloutsos, C. (2020). Telltail: Fast scoring and detection of dense subgraphs. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):4150–4157.
Huan, J., Wang, W., and Prins, J. (2003). Efficient mining of frequent subgraphs in the presence of isomorphism. In Proceedings of the Third IEEE International Conference on Data Mining, ICDM ’03, pages 549–, Washington, DC, USA. IEEE Computer Society.
Jahangiri, S., Carey, M., and Freytag, C. (2022). Design trade-offs for a robust dynamic hybrid hash join. Proceedings of the VLDB Endowment, 15(10).
Jamshidi, K., Mahadasa, R., and Vora, K. (2020). Peregrine: A pattern-aware graph mining system. In Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys ’20, New York, NY, USA. Association for Computing Machinery.
Jin, R., Xiang, Y., Ruan, N., and Fuhry, D. (2009). 3-hop: a high-compression indexing scheme for reachability query. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD ’09, page 813–826, New York, NY, USA. Association for Computing Machinery.
Junttila, T. and Kaski, P. (2007). Engineering an efficient canonical labeling tool for large and sparse graphs. In Proceedings of the Meeting on Algorithm Engineering & Expermiments, pages 135–149, Philadelphia, PA, USA. Society for Industrial and Applied Mathematics.
Kipf, T. N. and Welling, M. (2017). Semi-supervised classification with graph convolutional networks.
Kriege, N. and Mutzel, P. (2012). Subgraph matching kernels for attributed graphs. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, page 291–298, Madison, WI, USA. Omnipress.
Kuramochi, M. and Karypis, G. (2005). Finding frequent patterns in a large sparse graph*. Data Min. Knowl. Discov., 11(3):243–271.
Lan, H., Bao, Z., and Peng, Y. (2021). A survey on advancing the dbms query optimizer: Cardinality estimation, cost model, and plan enumeration. Data Science and Engineering, 6:86–101.
Leon-Suematsu, Y. I., Inui, K., Kurohashi, S., and Kidawara, Y. (2011). Web Spam Detection by Exploring Densely Connected Subgraphs. In 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, volume 1, pages 124–129.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
Manning, C. D. (2008). Introduction to information retrieval. Syngress Publishing,.
Manolopoulos, Y., Theodoridis, Y., Tsotras, V. J., Manolopoulos, Y., Theodoridis, Y., and Tsotras, V. J. (2000). Fundamental access methods. Advanced Database Indexing, pages 37–59.
Mawhirter, D. and Wu, B. (2019). Automine: Harmonizing high-level abstraction and high performance for graph mining. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, pages 509–523, New York, NY, USA. ACM.
Mawhirter, D., Reinehr, S., Holmes, C., Liu, T., and Wu, B. (2021). Graphzero: A high-performance subgraph matching system. SIGOPS Oper. Syst. Rev., 55(1):21–37.
Meng, C., Mouli, S. C., Ribeiro, B., and Neville, J. (2018). Subgraph pattern neural networks for high-order graph evolution prediction.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Miller, J. J. (2013). Graph database applications and concepts with neo4j. In Proceedings of the southern association for information systems conference, Atlanta, GA, USA, volume 2324, pages 141–147.
Nickel, M., Murphy, K., Tresp, V., and Gabrilovich, E. (2016). A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33.
Pandya, K. and Holia, M. (2023). Automating customer service using langchain: Building custom open-source gpt chatbot for organizations. arXiv preprint arXiv:2310.05421.
Paulheim, H. (2017). Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic web, 8(3):489–508.
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
Pržulj, N., Corneil, D. G., and Jurisica, I. (2004). Modeling interactome: scale-free or geometric? Bioinformatics, 20(18):3508–3515.
Qin, H., Li, R.-H., Wang, G., Qin, L., Cheng, Y., and Yuan, Y. (2019). Mining periodic cliques in temporal networks. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1130–1141.
Ribeiro, P., Paredes, P., Silva, M. E. P., Aparicio, D., and Silva, F. (2021). A survey on subgraph counting: Concepts, algorithms, and applications to network motifs and graphlets. ACM Comput. Surv., 54(2).
Robinson, I.,Webber, J., and Eifrem, E. (2015). Graph databases: new opportunities for connected data. " O’Reilly Media, Inc.".
Rogers, A., Kovaleva, O., and Rumshisky, A. (2021). A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866.
Suchanek, F. M., Kasneci, G., and Weikum, G. (2007). Yago: a core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, page 697–706, New York, NY, USA. Association for Computing Machinery.
Sumathi, S. and Esakkirajan, S. (2007). Fundamentals of relational database management systems, volume 47. Springer Science & Business Media.
Sun, X., Cheng, H., Li, J., Liu, B., and Guan, J. (2023). All in one: Multi-task prompting for graph neural networks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 2120–2131, New York, NY, USA. Association for Computing Machinery.
Taipalus, T. (2020). The effects of database complexity on sql query formulation. Journal of Systems and Software, 165:110576.
Teixeira, C. H. C., Fonseca, A. J., Serafini, M., Siganos, G., Zaki, M. J., and Aboulnaga, A. (2015). Arabesque: a system for distributed graph mining. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP ’15. ACM.
Ugander, J., Backstrom, L., and Kleinberg, J. (2013). Subgraph frequencies: mapping the empirical and extremal geography of large graph collections. In WWW.
Wang, K., Zuo, Z., Thorpe, J., Nguyen, T. Q., and Xu, G. H. (2018). Rstream: Marrying relational algebra with streaming for efficient graph mining on a single machine. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, pages 763–782, Berkeley, CA, USA. USENIX Association.
Yan, X. and Han, J. (2002). gspan: Graph-based substructure pattern mining. In Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM ’02, pages 721–, Washington, DC, USA. IEEE Computer Society.
Yang, Y., Yan, D., Wu, H., Cheng, J., Zhou, S., and Lui, J. C. (2016). Diversified temporal subgraph pattern mining. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 1965–1974, New York, NY, USA. Association for Computing Machinery.
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI.
Zhang, P. (2017). Practical Guide for Oracle SQL, T-SQL and MySQL. Crc press.
Zhao, H., Zhou, Y., Song, Y., and Lee, D. L. (2019). Motif enhanced recommendation over heterogeneous information network. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, pages 2189–2192, New York, NY, USA. ACM.
Zou, X. (2020). A survey on application of knowledge graph. In Journal of Physics: Conference Series, volume 1487, page 012016. IOP Publishing.