بازنمایی متن مبتنی‌ بر بافت با استفاده از موضوعات پنهان برای دسته‌بندی مقالات علمی

موسویان, مریم; قیومی, مسعود

doi:10.22051/jlr.2023.44640.2331

فهرست نشریات

کسب رتبه الف برای 6 نشریه دانشگاه الزهرا(س)

کسب رتبۀ «الف» در ارزیابی سال 1398 وزارت علوم توسط چهار نشریۀ علمی دانشگاه الزهرا

نمایه شدن نشریۀ علمی« جلوۀ هنر » در سامانه DOAJ

نمایه شدن مجله علمی« تحقیقات علوم قرآن و حدیث» در سامانه DOAJ

نمایه شدن مجله علمی« پژوهشهای حسابداری» در سامانه DOAJ

اخذ اعتبار علمی- پژوهشی مجله « فیزیک مرزهای مشترک و لایه های نازک» "Journal of Interface Thin films and Low dimensional systems" ،

اخذ اعتبار علمی- پژوهشی مجله « افقهای زبان»( language horizon) از اولین شماره ی چاپ شده

اخذ پذیرش 10مجله از مجموع مجلات علمی دانشگاه الزهرا در پایگاه بین المللی مجلات دسترسی آزادDOAJ

گشایش سامانه ی مجلات جدید علمی -تخصصی «الجرجانی» و علمی -پژوهشی« مبانی نظری هنرهای تجسمی»

برگزاری جلسه سردبیران مجلات علمی دانشگاه با حضور نماینده انتشارات اشپرینگر

تعداد نشریات	25
تعداد شماره‌ها	960
تعداد مقالات	7,934
تعداد مشاهده مقاله	13,426,086
تعداد دریافت فایل اصل مقاله	9,545,411

	بازنمایی متن مبتنی‌ بر بافت با استفاده از موضوعات پنهان برای دسته‌بندی مقالات علمی
زبان پژوهی
مقاله 2، دوره 15، شماره 49، اسفند 1402، صفحه 31-60 اصل مقاله (716.21 K)
نوع مقاله: مقاله پژوهشی
شناسه دیجیتال (DOI): 10.22051/jlr.2023.44640.2331
نویسندگان
مریم موسویان¹؛ مسعود قیومی^* ²
¹گروه مهندسی کامپیوتر، دانشکده مهندسی، دانشگاه صنعتی امیرکبیر، تهران، ایران
²پژوهشکده زبان‌شناسی، پژوهشگاه علوم انسانی و مطالعات فرهنگی، تهران، ایران
چکیده
سالانه، پژوهشگران در حوزه‌های گوناگون علمی یافته‌های پژوهش‌های خود را به‌صورت گزارش‌های فنی یا مقاله‌هایی در مجموعه‌مقالات یا مجله‌ها چاپ می‌کنند. گردآوری این نوع داده توسط موتورهای جست‌وجو و کتابخانه‌های دیجیتال، برای جست‌وجو و دسترسی به نشریه‌های پژوهشی به کار گرفته می‌شود که معمولاً مقاله‌های مرتبط بر اساس کلیدواژه‌های پرسمان به‌جای موضوعات مقاله بازیابی می‌گردد. در نتیجه، دسته‌بندی دقیق مقاله‌های علمی می‌تواند کیفیت جست‌وجوی کاربران را هنگام جست‌وجوی یک سند علمی در پایگاه‌های اطلاعاتی افزایش دهد. هدف اصلی این مقاله، ارائه یک مدل دسته‌بندی برای تعیین موضوع مقاله‌های علمی است. به این منظور، مدلی را پیشنهاد کردیم که از دانش بافتی غنی‌شده مقاله‌های فارسی مبتنی‌بر معناشناسی توزیعی بهره می‌برد. بر این اساس، شناسایی حوزۀ خاص هر سند و تعیین دامنۀ آن توسط دانش غنی‌شدة برجسته، دقت دسته‌بندی مقاله‌های علمی را افزایش می‌دهد. برای دست‌یابی به هدف، ما مدل‌های درونه‌یابی بافتی، اعم از ParsBERT یا XLM-RoBERTa را با موضوع‌های پنهان در مقاله‌ها را برای آموزش یک مدل پرسپترون چندلایه غنی می‌کنیم. بر اساس یافته‌های تجربی، عملکرد کلیParsBERT-NMF-1HT 72/37 درصد (ماکرو) و 75/21 درصد (میکرو) بر اساس معیار-اف بود که تفاوت عملکرد این مدل در مقایسه با مدل پایه از نظر آماری معنادار (p<0/05) بود.
کلیدواژه‌ها
تحلیل محتوایی مقاله؛ بازنمایی بافتی؛ معناشناسی توزیعی؛ شبکۀ عصبی؛ دسته‌بندی مقالۀ علمی؛ مدل‌سازی موضوع
عنوان مقاله [English]
Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers
نویسندگان [English]
Maryam Moosaviyan¹؛ Masood Ghayoomi²
¹Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran
²Faculty of Linguistics, Institute for Humanities and Cultural Studies, Tehran, Iran
چکیده [English]
Annually, researchers in various scientific fields publish their research results as technical reports or articles in proceedings or journals. The collocation of this type of data is used by search engines and digital libraries to search and access research publications, which usually retrieve related articles based on the query keywords instead of the article’s subjects. Consequently, accurate classification of scientific articles can increase the quality of users’ searches when seeking a scientific document in databases. The primary purpose of this paper is to provide a classification model to determine the scope of scientific articles. To this end, we proposed a model which uses the enriched contextualized knowledge of Persian articles through distributional semantics. Accordingly, identifying the specific field of each document and defining its domain by prominent enriched knowledge enhances the accuracy of scientific articles’ classification. To reach the goal, we enriched the contextualized embedding models, either ParsBERT or XLM-RoBERTa, with the latent topics to train a multilayer perceptron model. According to the experimental results, overall performance of the ParsBERT-NMF-1HT was 72.37% (macro) and 75.21% (micro) according to F-measure, with a statistical significance compared to the baseline (p<0.05).
کلیدواژه‌ها [English]
Article Content Analysis, Contextualized Representation, Distributional Semantics, Neural Network, Scientific Article Classification, Topic Modeling

مراجع
Bijankhan, M., Sheikhzadegan, J., & Samareh, M. R. Y. (1994). FARSDAT - The Speech Database of Farsi Spoken Language. Proceedings of the 5th Internationa Conference on Speech Science and Technology, 2, 826–831. https://www.researchgate.net/publication/292798168_The_speech_database_of_Farsi_spoken_language Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022. https://dl.acm.org/doi/10.5555/944919.944937 Borko, H. (1968). Information science: What is it? American Documentation, 19(1), 3–5. https://doi.org/10.1002/asi.5090190103 Chen, Y., Zhang, H., Liu, R., Ye, Z., & Lin, J. (2019). Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowledge-Based Systems, 163, 1–13. https://doi.org/10.1016/j.knosys.2018.08.011 Chowdhury, S., & Schoen, M. P. (2020). Research paper classification using supervised machine learning techniques. 2020 Intermountain Engineering, Technology and Computing (IETC), 1–6. https://doi.org/10.1109/IETC47856.2020.9249211 Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747 Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423 EmamiAzadi, T., & AlmasGanj, F. (2006). Topic classification of Persian texts based on the improved probabilistic latent semantic analysis. The 12th Conference of Iran’s Computer Society, Tehran. https://civilica.com/doc/44669/ Farahani, M., Gharachorloo, M., Farahani, M., & Manthouri, M. (2021). Parsbert: Transformer-based model for Persian language understanding. Neural Processing Letters, 53(6), 3831–3847. https://doi.org/10.1007/s11063-021-10528-4 Févotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix factorization with the $β$-divergence. Neural Computing, 23(9), 2421–2456. https://doi.org/10.1162/NECO_a_00168 Ghayoomi, M., & Mousavian, M. (2022). Application of the neural network-based machine learning method to classify scientific articles. Iranian Journal of Information Processing & Management, 37(4), 1217-1244. https://doi.org/10.35050/JIPM010.2022.008 Harris, Z. S. (1954). Distributional structure. Word, 10(2–3), 146–162. https://doi.org/10.1080/00437956.1954.11659520 Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169–15211. https://doi.org/10.1007/s11042-018-6894-4 Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall: Upper Saddle River, New Jersey. Karami, A., Gangopadhyay, A., Zhou, B., & Kharrazi, H. (2018). Fuzzy approach topic discovery in health and medical corpora. International Journal of Fuzzy Systems, 20(4), 1334–1345. https://doi.org/10.1007/s40815-017-0327-9 Kim, S.-W., & Gil, J.-M. (2019). Research paper classification systems based on TF-IDF and LDA schemes. Human-Centric Computing and Information Sciences, 9(1), 1–21. https://doi.org/10.1186/s13673-019-0192-7 Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. ArXiv Preprint ArXiv:1907.11692. https://arxiv.org/abs/1907.11692 MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297. https://projecteuclid.org/Proceedings/berkeley-symposium-on-mathematical-statistics-and-probability/proceedings-of-the-fifth-berkeley-symposium-on-mathematical-statistics-and/toc/bsmsp/1200512974 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Proceedings of the 26th International Conference on Neural Information Processing Systems (pp. 3111–3119). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf Mustafa, G., Usman, M., Yu, L., Sulaiman, M., & Shahid, A. (2021). Multi-label classification of research articles using Word2Vec and identification of similarity threshold. Scientific Reports, 11(1), 1–20. https://doi.org/10.1038/s41598-021-01460-7 Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (2000). Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61(2), 217–235. https://doi.org/10.1006/jcss.2000.1711 Rabiei, M., HosseiniMotlagh, S. M., & MinaeiBidgoli, B. (2019). Using One-Class SVM for Scientific Documents Classification Case study: Iranian Environmental Thesis. Iranian Journal of Information Processing and Management, 34(3), 1211–1234. https://doi.org/10.35050/JIPM010.2019.036 Rivest, M., Vignola-Gagné, E., & Archambault, É. (2021). level classification of scientific publications: A comparison of deep learning, direct citation and bibliographic coupling. PloS One, 16(5), e0251493. https://doi.org/10.1371/journal.pone.0251493 Salton, G. (1971). The SMART Retrieval System — Experiments in Automatic Document Processing. Prentice-Hall, Inc. Shokouhian, M., Asemi, A., Shabani, A., & Cheshmesohrabi, M. (2020). Presenting a Thematic Model of Health Scientific Productions Using Text-Mining Methods. Iranian Journal of Information Processing and Management, 35(2), https://doi.org/553-574. 10.35050/JIPM010.2020.061 Teymoorpoor, B., Sepehri, M.-M., & Pezeshk, L. (2009). A new method for topic classification of scientific texts (case study on the articles of the nanotechnology of Iranian specialists). Policy of Science and Technology, 2(2), 1–15. https://doi.org/20.1001.1.20080840.1388.2.2.2.7 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
آمار تعداد مشاهده مقاله: 287 تعداد دریافت فایل اصل مقاله: 253

سامانه مدیریت نشریات علمی. طراحی و پیاده سازی از سیناوب

پیوندهای مفید

پیوندهای مفید

اخبار و اعلانات

آمار

بازنمایی متن مبتنی‌ بر بافت با استفاده از موضوعات پنهان برای دسته‌بندی مقالات علمی