Arabic NLP.
Practical.
Five days of building things in Arabic that actually work in production.
Who it's for
Engineers building Arabic-facing systems.
ML engineers and product engineers shipping Arabic search, chatbots, document intelligence, or translation.
Five-day syllabus
Build it the right way.
DAY01
Arabic linguistic foundations for engineers
- Script properties: ligatures, shaping, diacritics, kashida — and what your text editor lies about
- Morphology: roots, patterns, clitics — why English-style stemming destroys Arabic
- MSA vs Gulf, Levantine, Egyptian, Maghrebi — and the dialect-detection problem
- Unicode pitfalls: Alef variants, Yaa/Alif Maqsura, Hamza positions, BiDi
LabBuild a robust Arabic text normaliser: NFC, dediacritisation, Alef/Yaa unification, tatweel removal, with property-based tests against edge cases.
DAY02
Tokenisation & embeddings
- BPE, WordPiece, SentencePiece on Arabic — token explosion and what to do about it
- Arabic-aware tokenisers: AraBERT, MARBERT, AraGPT, Jais — strengths and gotchas
- Cross-lingual embeddings: when bilingual retrieval works, when it fails
- Code-switching (Arabizi, Franco-Arabic) and Latin-script Arabic
LabTrain a SentencePiece tokeniser on a provided 2 GB Arabic corpus. Compare token efficiency against three off-the-shelf tokenisers on news, dialect, and code-switched samples.
DAY03
Classification, NER & retrieval
- Fine-tuning Arabic encoders for classification — sentiment, topic, intent
- NER for Arabic: named entities, place names, mixed-script handling
- Semantic search with Arabic embedding models, hybrid + reranking
- Cross-lingual retrieval: Arabic query → English documents and vice versa
LabBuild an NER pipeline for Arabic news. Annotate 200 sentences as a golden set; fine-tune a model; report per-entity F1 and analyse failures by dialect.
DAY04
Generation, translation & RTL UX
- Arabic generative models — Jais, AceGPT, and prompting in Arabic
- Translation pipelines: NLLB, custom MT, post-editing patterns
- Bidirectional UI: number bidi, embedded English, mixed punctuation
- Right-to-left rendering across web, PDF, and chat surfaces
LabBuild an Arabic chatbot over a domain corpus, tested against an MSA-only evaluation set and a Gulf-dialect evaluation set. Quantify the dialect gap.
DAY05
Evaluation, deployment, and capstone
- Arabic-specific evaluation: BLEU/chrF caveats, COMET-style judges, human review protocols
- Bias and dialect coverage — the failures regulators will eventually ask about
- Deployment patterns specific to Arabic-heavy products
- Data licensing and Arabic corpus realities — what you can and cannot use
CapstoneTake a real Arabic NLP problem from your team, present an end-to-end design including normaliser, tokeniser, model, evaluation set, and a one-page risk register.
Lead instructor
Shady Ali
Shady Ali
Co-Director · Sadiqoon Technologies
Has built Arabic NLP systems in production for archive, banking, and healthcare clients. Native Arabic speaker.
What's included
Inclusions.
- 35 hours of live instructor-led tuition
- Lab GPU environment + Arabic corpora pack
- Bilingual workbook (English + Arabic) + code repository
- Daily lunch & refreshments; cohort dinner Day 4
- UK visa support letter on confirmed registration
- 30 days of post-course Q&A access
- Sadiqoon Institute Certificate of Completion
- Curated Arabic dataset list (open-licence) for ongoing work
Register