ساخت پیکره مقایسه‌ای تخصصی «پارسا»

علایی ابوذر, الهام; حجت‌پناه, علی اصغر

doi:10.22051/jlr.2023.44928.2348

فهرست نشریات

کسب رتبه الف برای 6 نشریه دانشگاه الزهرا(س)

کسب رتبۀ «الف» در ارزیابی سال 1398 وزارت علوم توسط چهار نشریۀ علمی دانشگاه الزهرا

نمایه شدن نشریۀ علمی« جلوۀ هنر » در سامانه DOAJ

نمایه شدن مجله علمی« تحقیقات علوم قرآن و حدیث» در سامانه DOAJ

نمایه شدن مجله علمی« پژوهشهای حسابداری» در سامانه DOAJ

اخذ اعتبار علمی- پژوهشی مجله « فیزیک مرزهای مشترک و لایه های نازک» "Journal of Interface Thin films and Low dimensional systems" ،

اخذ اعتبار علمی- پژوهشی مجله « افقهای زبان»( language horizon) از اولین شماره ی چاپ شده

اخذ پذیرش 10مجله از مجموع مجلات علمی دانشگاه الزهرا در پایگاه بین المللی مجلات دسترسی آزادDOAJ

گشایش سامانه ی مجلات جدید علمی -تخصصی «الجرجانی» و علمی -پژوهشی« مبانی نظری هنرهای تجسمی»

برگزاری جلسه سردبیران مجلات علمی دانشگاه با حضور نماینده انتشارات اشپرینگر

تعداد نشریات	25
تعداد شماره‌ها	960
تعداد مقالات	7,934
تعداد مشاهده مقاله	13,423,445
تعداد دریافت فایل اصل مقاله	9,541,560

	ساخت پیکره مقایسه‌ای تخصصی «پارسا»
زبان پژوهی
مقاله 8، دوره 16، شماره 52، مهر 1403، صفحه 219-246 اصل مقاله (603.31 K)
نوع مقاله: مقاله پژوهشی
شناسه دیجیتال (DOI): 10.22051/jlr.2023.44928.2348
نویسندگان
الهام علایی ابوذر^* ¹؛ علی اصغر حجت‌پناه²
¹استادیار پژوهشگاه علوم و فناوری اطلاعات ایران (ایرانداک). تهران.ایران
²پژوهشگاه علوم و فناوری اطلاعات ایران (ایرانداک)، تهران، ایران
چکیده
پیکره ها براساس زبان به‌کاررفته در متن‌های تشکیل دهندة آنها به پیکرههای تک زبانه، دوزبانه و چندزبانه گروه‌بندی میشوند. پیکرة مقایسه ای، پیکرهای است دوزبانه یا چندزبانه که شامل متن‌هایی است مشابه در حوزههای موضوعی یکسان. با وجود کاربرد فراوان این نوع پیکره‌ها در پژوهش‌های گوناگون همچون پژوهشهای زبانی، ترجمة ماشینی و سامانه‌های خودکار بازیابی اطلاعات بینازبانی، پژوهشگران همواره با کمبود پیکره‌های مقایسه ای مواجه بوده‌اند. در این مقاله، به معرفی مراحل ساخت یک پیکرة مقایسه‌ای تخصصی به نام «پارسا» پرداخته شده‌است. این پیکره از چکیدههای فارسی و انگلیسی پایان نامه ها و رساله های ثبت‌شده در پژوهشگاه علوم و فناوری اطلاعات ایران (ایرانداک) ساخته شده‌است و شامل بیش از 89 میلیون واژه فارسی و 79 میلیون واژه انگلیسی است. محتوای این پیکره عمومی نیست و مشتمل بر متن‌های بسیار تخصصی در حوزههای موضوعی کلان مانند علوم اجتماعی، علوم انسانی و هنر، فنی ومهندسی و رشته های مربوط به این حوزهها است و ازاین‌جنبه، برای پردازشهای زبانی که نیازمند بهره گرفتن از متن‌های تخصصی است، بسیار ارزشمند است. برای ساخت این پیکره، پس از نمونه گیری، دادههای فارسی وارد فرایند پیش پردازش (هنجارسازی و واحدسازی) شدند. برای ارزیابی این مرحله دقت (P)، فراخوان (R) و F1 سنجیده شد. دقت، 5614035088. 0، فراخوان، 0531561462. 0 و در پایان، F1 09711684370257966. 0 محاسبه شده‌است. سپس، دادهها برچسب‌گذاری شدند (برچسب گذاری اجزای کلام) و برچسبهای متون فارسی کنترل شدند. دادههای انگلیسی نیز به‌صورت ماشینی برچسب‌گذاری شدند. شمار واژه‌های محتوایی (فعل، اسم، صفت، قید) دادههای فارسی این پیکره 57653813 و شمار واژههای دستوری به‌همراه اعداد و علائم سجاوندی 31350125 است و بن واژههای فارسی استخراج شده نیز شامل 41064 بن واژه است. شمار واژههای محتوایی متون انگلیسی 45606686 و شمار واژههای دستوری به‌همراه اعداد و علائم سجاوندی شامل 33662304 و بن‌واژههای انگلیسی استخراج شده نیز شامل 12937 بن واژه است. پیکرۀ ساخته ‎شده قابلیت بسیار بالایی برای داده‌کاوی، پژوهشهای مربوط به ترجمه ماشینی و به‌کارگیری در تمام پژوهش‌هایی که بر روی متون علمی انجام می‌شود را دارا است.
کلیدواژه‌ها
پیکره تخصصی؛ پیکره مقایسه ای؛ هنجارسازی؛ واحدسازی؛ برچسب گذاری
عنوان مقاله [English]
Building a specialized comparable corpus: PARSA
نویسندگان [English]
Elham Alayiaboozar¹؛ Aliasghar Hojjatpanah²
¹Assistant Professor, Iranian Research Institute for Information Science and Technology (IranDoc). Tehran.Iran
²Iranian Research Institute for Information Science and Technology (IranDoc); Tehran. Iran
چکیده [English]
Based on the language used in their constituent texts, corpora are categorized as monolingual, bilingual, or multilingual. A comparable corpus is a bilingual or multilingual corpus that includes similar texts in the same subject areas. In other words, a comparable corpus is a collection of documents in two different languages that cover similar topics. Comparable corpora can be composed of general texts, providing various possibilities for discourse analysis, pragmatics, analysis of text genres, and sociolinguistics. Examples of such corpora could include collections of encyclopedia entries, or literary texts from a certain period of time. However, the most common types of comparable corpora, which attract many audiences are those related to specialized fields and containing a high density of vocabulary and technical terms. Such a corpus is called a specialized comparable corpus. In this study, a specialized comparable corpus was built from the Persian and English abstracts of theses and dissertations registered in IranDoc. The corpus is named PARSA.
کلیدواژه‌ها [English]
specialized corpus, comparable corpus, normalization, tokenization, tagging
سایر فایل های مرتبط با مقاله 08E.pdf
مراجع
امرایی، علیرضا، اکبر حسابی و عباس اسلامی راسخ (1398). «طراحی پیکره و فرهنگ دوزبانة اصطلاحات راهنمایی و رانندگی بر پایة معناشناسی قالبی». مطالعات زبان و ترجمه. دوره 52. شماره 2. صص 97-65. https://doi.org/10.22067/lts.v52i2.80823/ دشتبانی، شکوفه، محرم منصوری زاده و محمد نصیری (1393). «پیکرة متنی تطبیقی فارسی-انگلیسی حوزة تخصصی فاوا». پژوهشهای زبانشناسی تطبیقی. سال 4. شماره 8. صص 141-121. Retrieved from <https://rjhll.basu.ac.ir/article_972.html> صادقی، علی اشرف (1370-1372). شیوه ها و امکانات واژه سازی در زبان فارسی معاصر (1-12). تهران: نشر دانش. شمارة 64-80. <https://ensani.ir/fa/article/293365/> Retrieved from علایی، الهام، نصراله پاکنیت، علی‌اصغر حجت پناه، مجتبی زالی و محمدهادی آقالویی آغمیونی ( 1400). ساخت پیکرة متنی از مقاله‌های پژوهشنامة پردازش و مدیریت اطلاعات. تهران: پژوهشگاه علوم و فناوری اطلاعات ایران (ایرانداک). Retrieved from <https://irandoc.ac.ir/sites/fa/files/attach/research/559pf.pdf> قطره، فریبا (1386). «مشخصه های تصریفی در زبان فارسی امروز». دستور. شماره 3. صص 52-81. Retrieved from <https://ensani.ir/fa/article/99232> قیومی، مسعود (1401). «پیشپردازش و ابزارهای پایه». در پردازش و متن گفتار فارسی: مروری بر مبانی نظری و آخرین یافته های پژوهشی». به کوشش مهرنوش شمس فرد و محمود بی جن خان. تهران: سازمان مطالعه و تدوین کتب دانشگاهی در علوم اسلامی و انسانی (سمت). پژوهشکده تحقیق و توسعه علوم انسانی. صص 86-113. Retrieved from <https://samt.ac.ir/fa/book/6143> کشانی، خسرو (1371). اشتقاق پسوندی در زبان فارسی امروز. تهران: مرکز نشر دانشگاهی. Retrieved from <https://daneshnegar.com/fa/product/39614> کوهستانی، منوچهر (1389). بررسی خطاهای املایی و نگارشی در وبلاگ‌های فارسی و ماهیت زبان‌شناختی آنها. پایان‌نامة کارشناسی ارشد. دانشگاه تهران. لازار، ژیلبر (1389). دستور زبان فارسی معاصر. ترجمة مهستی بحرینی. چ 2. تهران: انتشارات هرمس. Retrieved from <https://www.hermespub.ir/product/> محمدی، علی محمد (1402). «رابطۀ بین عناصر گفتمانی در پیکره های موازی: مورد پژوهیِ ترجمة شفاهی همزمان». زبان‌پژوهی. دورة 15. شمارة 47. صص 262-293. https://doi.org/10.22051/jlr.2021.36750.2056 محمدی، رویا (1391). ساخت پیکرۀ تطبیقی فارسی-انگلیسی و استخراج جملات موازی از آن. پایان نامۀ کارشناسی ارشد. دانشگاه الزهرا (س). Retrieved from <https://elmnet.ir/doc/10526832-12611> Alayiaboozar, E., & Hojjatpanah, A (2022). Steps for creating two Persian specialized corpora. International Journal of Information Science and Management (IJISM), 20(4), 231-243. https://ijism.isc.ac/article_698428.html Alayiaboozar, E., Pakniat, N., Zali, M., & Aghalooyi Aghmiyooni,.M.H. (2021). Building a corpus from the published articles of Iranian Journal of Information Management and Processing. Iranian Research Institute for Information Science and Technology (Irandoc). https://irandoc.ac.ir/sites/fa/files/attach/research/559pf.pdf [In Persian] Asghari, H., Khoshnava, Kh., Fatemi, O., & Faili, H. (2015, September 8-11). Developing bilingual plagiarism detection corpus using sentence aligned parallel corpus [Conference presentation]. Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France. https://ceur-ws.org/Vol-1391/148-CR.pdf Atkins, S. J. Clear., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1-16. https://doi.org/10.1093/llc/7.1.1 Beloso, B. S. (2015). Designing, describing and compiling a corpus of English for architecture. Procedia-social and behavioral sciences, 198, 459-464. https://doi.org/10.1016/j.sbspro.2015.07.466 Bijankhan, M., Sheykhzadegan, J., Bahrani, M., & Ghayoomi, M. (2011). Lesson from building a Persian written corpus: Peykare. Language resources and evolution, 45(2), 143-164. https://doi.org/10.1007/s10579-010-9132-x Claude Toriida, M. (2016). Steps for creating specialized corpus and developing an annotated frequence-based vocabulary list. TESL Canada journal/ revue TESL du Canada, 34(11), 87-105. https://doi.org/10.18806/tesl.v34i1.1257 Dashtbani, Sh., Mansoorizade, M., & Nasiri, M. (2014). English-Persian comparable textual corpus in FAVA domain. Comparative linguistic research, 4(8), 121-141. https://rjhll.basu.ac.ir/article_972.html [In Persian] Emrayi, A., Hesabi, A., & Eslami Rasekh, A. (2019). Designing corpus and bilingual traffic terms based on frame semantics. Language and translation studies, 52(2), 65-97. https://doi.org/10.22067/lts.v52i2.80823 [In Persian] Ghatre, F. (2007). Inflectional features in modern Persian. Dastoor, 3, 52-81. https://ensani.ir/fa/article/99232 [In Persian] Ghayoomi, M. (2022). Preprocessing and basic tools. In Shams Fard, M. & Bijan Khan, M. (Eds.), Text and speech processing for the Persian language: the state of art and a brief review of the theoretical foundations (pp. 86-113). SAMT. https://samt.ac.ir/fa/book/6143 [In Persian] Ghayoomi, M., Momtazi, S., & Bijankhan, M. (2010). A Study of Corpus Development for Persian. International Journal of Asian Language Processing, 20(1), 17-34. https://www.colips.org/journals/volume20/20.1.02-Masood-Ghayoomi.pdf Karimi, A., Ansari, E., & Sadeghi Bigham, B. (2017). Extracting an English-Persian parallel corpus from comparable corpora. (Project: Machin translation. Parallel sentence extraction from comparable corpora using statistical machine translation). Arxiv: 1711.00681v3 [cs.CL]. https://doi.org/10.48550/arXiv.1711.00681 Kenning, M. M. (2010). What are parallel and comparable corpora and how can we use them. In O’Keeffe, A., McCarthy, M. (Eds.), The Routledge Handbook of Corpus Linguistics (pp. 487–500). Routledge. https://www.routledge.com/ Keshani, Kh. (1992). Derivation suffix in modern Persian. Markaz Nashr Daneshgahi. https://daneshnegar.com/fa/product/39614 [In Persian] Kokabi, A., Nourian, A., Ghafourzadeh, E., Imani, M., Fallah, M., Mahdavi Mortazavi, M., Ghorbani, M., Ruhollah, R., Ebrahimi, M., Riasati, R., Khallash, M., Khosrotabar, M., Bashari, H., Mahdizade, M., Souri, Y., Kharazi, V… Qayyoomi, A. (2023, October 5). Persian NLP Toolkit. github. https://github.com/roshan-research/hazm Koltunski, E. L. (2013). VARTRA: A comparable corpus for analysis of translation variation. In Sharoff, S., Zweigenbaum, P., & Rapp, R. (Eds.), Proceedings of the 6^th workshop on building and using comparable corpora. (pp. 77-86). Association for computational linguistics. https://www.researchgate.net/publication/ Kouhestani, M. (2010). Studying written errors In Persian weblogs and their linguistic nature [Unpublished master’s thesis]. University of Tehran. [In Persian] Lazard, G. (2010). Persian Grammar. Hermes. https://www.hermespub.ir/product/ [In Persian] Mohammadi, A. M. (2023). A study of the relationship between discoursal elements in parallel corpora: a case study of simultaneous interpretation. ZABANPAZHUHI (journal of language research), 15(47), 236-262. https://doi.org/10.22051/jlr.2021.36750.2056 [In Persian] Mohammadi, R. (2012). Building Persian-English comparable corpus and extracting parallel sentences [Unpublished master’s thesis]. Alzahra University. https://elmnet.ir/doc/10526832-12611 [In Persian] Sadeghi, A. A. (1991-1993). Word formation methods In Persian. Danesh publication. https://ensani.ir/fa/article/293365/ [In Persian] Sinclair, J. (2004). Corpus and Text-Basic Principles. In Wynne, M. (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 5-25). The Oxford Text Archive. https://users.ox.ac.uk/~martinw/dlc/chapter1.htm
آمار تعداد مشاهده مقاله: 279 تعداد دریافت فایل اصل مقاله: 280

سامانه مدیریت نشریات علمی. طراحی و پیاده سازی از سیناوب

پیوندهای مفید

پیوندهای مفید

اخبار و اعلانات

آمار

ساخت پیکره مقایسه‌ای تخصصی «پارسا»