The Multilingual Corpus Extractor (MCE) is a cutting-edge web-based application developed by the Corpus Research Center (CRC) at Minhaj University Lahore. It provides researchers, linguists, teachers, and students with a robust and user-friendly platform for extracting textual data from webpages and preparing it for linguistic analysis and corpus development.
MCE is multilingual by design, supporting the extraction and segmentation of Urdu, Arabic, English, and mixed-language content. It intelligently highlights and classifies valid and invalid text, making it highly suitable for computational linguistics, semantic analysis, discourse studies, and educational applications.
01
Multilingual: Supports Urdu, Arabic, English, and mixed languages
02
Highlighting: Color-coded paragraphs for semantic clarity.
03
Download: Download valid/invalid text as .txt files.
04
ZIP Export: Cleaned text segments exported as plain .txt files.
05
CSV Upload: Import URLs from a CSV file’s first column.
06
Bulk Input: Paste or type multiple URLs separated by comma or newline..
07
Web Access: Access from any device, no install needed.
08
Data Privacy: No data is saved. Session-based only.
09
Remove: Delete unwanted text before saving.
10
Edit: Inline editing of text before export.
Corpus Research Center. (2025). Multilingual Corpus Extractor (MCE) [Web-based tool]. Minhaj University Lahore.
The MCE is a digital tool and may exhibit minor inaccuracies due to the variability in web content formatting. Users are encouraged to verify and preprocess the extracted data before academic use.
This tool is the intellectual property of the Corpus Research Center (CRC), Minhaj University Lahore. All rights reserved. Unauthorized replication, redistribution, or commercialization of the tool or its components is strictly prohibited.
For inquiries or permissions, contact: admin.crc@mul.edu.pk