From structure to meaning: a lexical semantic framework for Urdu compounding

Abstract

Urdu, a rich Indo-Aryan language, relies extensively on derivational and inflectional processes for lexical expansion. Compounding, a pivotal word-formation process, has received a limited scholarly focus despite its central role in Urdu’s linguistic complexity. This study investigates compounding in Urdu by employing Lieber’s Lexical Semantic Framework (LSF) to examine its semantic and morphological dimensions. Employing a qualitative descriptive design, the study analyzes 30 purposively sampled compounds from The Express newspaper and Feroz-ul-Lughat dictionary, representing prevalent morphological patterns such as noun-noun (N + N), noun-adjective (N + Adj), and noun-verb (N + V) structures. The findings demonstrate LSF’s adaptability to Urdu, uncovering transparency and opacity in semantic relationships. A unique pattern of argumental compounding emerges, where constituent elements interact to create culturally resonant meanings. Furthermore, the analysis reveals compound-specific innovations in Urdu, diverging from conventional typologies, and enriching the theoretical understanding of lexical semantics. These findings have significant implications for natural language processing (NLP), especially in enhancing machine translation and text analysis tools for Urdu. This study contributes to the broader linguistic discourse by showcasing the complex interaction of morphology and semantics in Urdu, while also providing a methodological model for analyzing resource-poor languages. Future research could explore the role of sociolinguistic factors and regional influences in compounding processes, deepening the understanding of word formation in Urdu and related languages.

Full text

Translate

Turn on search term navigation

Introduction

Urdu, like many South Asian languages, thrives on the creative fusion of words to express complex ideas, emotions, and cultural values. Consider the compound word /dʒeːb χəɾtʃ/, which literally translates to “pocket-expense” but functionally means pocket money. This compound elegantly combines two nouns—/dʒeːb/ meaning pocket and /χəɾtʃ/ meaning expense or spending—to produce a culturally resonant term that reflects a common socio-economic practice in Urdu-speaking communities. Such compounds go beyond literal combinations; they encapsulate complex social practices, contextual meanings, and pragmatic usage that are deeply embedded in cultural discourse (Nabi et al., 2025). This example illustrates why compounding is critical in Urdu linguistics: It functions as a morphological process and simultaneously acts as a vehicle for meaning-making that reflects cultural specificity.

Despite having over 100 million speakers worldwide, with around 60% of Pakistanis learning it as a second language (Dar et al., 2015; Dogar et al., 2024; Haroon et al., 2023), Urdu remains a resource-poor language in terms of linguistic tools and computational resources (Batool and Saleem, 2023). This paradox highlights the pressing need for detailed studies into Urdu’s morphology and lexicon to support language processing efforts. A critical area within Urdu linguistics that warrants further exploration is word formation, where processes like compounding play a central role. In linguistics, the creation of new lexical items, including those formed through compounding, falls under the purview of morphology, specifically focusing on how morphemes—the smallest meaningful units—combine to form complex words (Aziz et al., 2020).

Previous studies emphasize various morphological processes, including derivation, inflection, and borrowing, each contributing to the evolution and expansion of Urdu’s lexicon (Maqsood et al., 2018). However, research on Urdu compounding is relatively limited, despite its pivotal role in expanding the language’s expressive capacity. Compounding in Urdu manifests in diverse forms, including noun-noun, noun-adjective, and verb-verb structures, as seen in compounds like “kitab-khana” (library, lit. “book-house”) and “dekh-bhal” (supervision, lit. “look-after”). These compounds generate new meanings while capturing cultural and historical aspects through the integration of influences from languages such as Persian and Arabic (Aziz et al., 2020; Masood et al., 2018).

The existing literature indicates that Urdu compounds can vary widely in terms of semantic transparency—some compounds have meanings that are directly inferred from their parts, while others have opaque meanings, where the compound’s meaning diverges from its constituents. Understanding this semantic spectrum is crucial for enhancing natural language processing (NLP) tools, yet there remains a lack of comprehensive studies on Urdu compounding using modern lexical semantic frameworks. To address this gap, the current study employs Lieber (2004, 2007, 2009, 2010, 2021) Lexical Semantic Framework (LSF) to examine compounding in Urdu, focusing on two primary research questions: (1) the degree of semantic transparency in Urdu compounds, and (2) the influence of constituent relationships on the observed opacity.

By systematically analyzing semantic patterns and contextual factors in Urdu compounds, this study contributes to NLP applications like machine translation, which face challenges due to Urdu’s underexplored morphology (Butt and King, 2007; Ambreen and To, 2024). This research aims to deepen our understanding of Urdu compounding while laying the groundwork for developing linguistic resources essential to advancing Urdu language processing.

Statement of the problem

Urdu’s complex structure, with its rich derivational and inflectional processes, poses significant challenges in linguistic analysis, particularly in the area of computational language processing. A notable gap exists in understanding how compounding—one of the primary word-formation methods in Urdu—fits within established linguistic frameworks. Compounds, which involve merging two or more roots to form a single, meaningful expression, often feature complex semantic relationships and syntactic structures that are difficult to categorize. This study tackles this challenge by employing Lieber’s (2004) LSF, aiming to examine its effectiveness in deciphering the nuanced semantics of Urdu compound words. Furthermore, it seeks to identify, categorize, and compare Urdu’s unique compound types with Lieber’s typology. This investigation evaluates the adaptability of the LSF to Urdu and highlights distinctive compounding patterns within the language, including noun-noun (N + N), noun-adjective (N + Adj), and noun-verb (N + V) combinations. These patterns differ from traditional categories and contribute to a deeper understanding of Urdu’s lexical structure while also addressing challenges in Urdu language processing.

Research questions

This study is guided by the following research questions:

What are the primary semantic patterns observed in Urdu compound words, and how do they reflect the language’s cultural and morphological complexity?
How effectively does Lieber’s LSF capture the structural and semantic complexities of Urdu compounding?

Literature review

Linguistic research has long examined the interplay of semantic and morphological processes in word formation, particularly within complex languages like Urdu. As an Indo-Aryan language with intricate inflectional and derivational mechanisms, Urdu presents a valuable case for studying compounding as a primary word-formation process (Khan et al., 2025; King, 2008; Noreen et al., 2024). Compounding allows Urdu to generate nuanced, culturally resonant meanings, integrating native roots with influences from Persian and Arabic. While previous studies have explored this process, many analyses have focused on surface-level features, such as phonological patterns and morphological adjustments, without delving into the complex interaction between constituents’ meanings and the overall compound semantics (Daud et al., 2017).

Cognitive frameworks have sought to bridge this gap by analyzing conceptual integration in Urdu compounds. Saeed (2015) applies blending theory to describe how metaphorical and metonymic processes yield new, blended meanings, while Daud et al. (2017) use image schemas from Lakoff and Johnson (1980) to uncover underlying conceptual structures in compounds. However, these cognitive approaches, although insightful, lack formal rigor, limiting their capacity to capture the full semantic and relational diversity within Urdu compounds. This limitation signals a need for models that provide structured, reliable insights into the semantic relationships and dependencies characterizing Urdu compounding.

This study employs Lieber’s (2004) LSF to examine Urdu compounds through a structured semantic lens to address these gaps. LSF offers a comprehensive approach by decomposing lexical items into constituent parts, emphasizing their interaction across morphological boundaries (Andreou, 2017; Lieber and Baayen, 1999). Key principles of LSF include the decompositional nature of compounds, lexical capacity beyond mere reference, cross-categorical applicability, and a focus on foundational lexeme meaning. These components allow LSF to address phenomena like polysemy, zero derivation, and form-meaning mismatches within complex word structures, providing critical insights into compounding as a morphological process (Hamawand, 2024).

This study leverages the LSF and further incorporates Distributional Semantics (DS) and Conceptual Dependency Theory (CDT) to enhance the interpretive capacity of LSF. DS uses statistical analysis to examine semantic similarities across Urdu compounds, identifying relationships such as synonymy or antonymy that enhance our understanding of compound formation. CDT, on the other hand, provides a formal structure for the conceptual dependencies among compound constituents, mapping thematic roles and causal links crucial to understanding compound semantics (Schank, 2019). By combining LSF, DS, and CDT, this study moves beyond descriptive analysis, adopting a multi-layered approach that critically assesses Urdu compounding through semantic coherence, constituent interaction, and conceptual integration. Nevertheless, this research aims to advance the analysis of Urdu compounding by applying a rigorous LSF that addresses the limitations of previous studies. By situating Urdu compounds within a well-defined analytical model, the study enriches the understanding of Urdu’s word-formation processes and provides valuable insights for computational linguistics and natural language processing in this understudied language.

Theoretical framework

The LSF, as established by Lieber (2004), centers its scrutiny upon the intricate mechanisms underpinning the semantic composition of complex lexical units (Lieber, 2004). This domain encompasses affixes, compounds, polysemous words, derivational entities, and multi-affixed formations, among others. Within this framework, Lieber’s 1992 investigation into the deconstruction of morphological elements illuminates the intricate interplay between syntactic rules and the generative approach to morphology. Essentially, LSF delves into the semantic foundations of lexemes – both base forms and affixes—elucidating how their interaction contributes to the overall meaning of a complex word.

Moreover, LSF offers a comprehensive approach to understanding the construction of meaning within word-formation processes. Its decompositional nature emphasizes the systematic disassembly of complex lexical units into their constituent parts, allowing for a detailed analysis of semantic composition (Lieber, 2004). LSF acknowledges the inherent semantic potential of lexical entries, extending beyond mere reference to encompass the broader capacity of words to convey meaning (Scalise et al., 2009). The framework’s versatility is demonstrated through its ability to analyze multiple categories, transcending traditional boundaries between nouns, verbs, adjectives, and other categories, facilitating an integrated understanding of how these semantic properties interact (Wechsler, 2020). By focusing on the role of basic lexical units, LSF enhances comprehension of how simple words contribute to forming more complex terms, highlighting the nuanced relationship between form and meaning (Andreou, 2017). A central objective of LSF is to provide a systematic characterization of the contributions made by individual lexemes and affixes during word formation, offering a deeper understanding of their integration (Lieber, 2004). This approach also addresses key linguistic phenomena such as polysemy, exploring how derivational affixes can exhibit multiple meanings, such as how “-ness” can denote both quality and state (Pustejovsky and Bouillon, 1995). Moreover, LSF tackles the phenomenon of zero derivation, where words like “red” and “redness” retain similar meanings despite affixation, providing an explanatory mechanism for these cases (Bauer, 2003). The framework also illuminates instances of form-meaning mismatches, such as in the words “de-ice” and “de-fuse,” offering tools to analyze and account for these discrepancies (Kiparsky, 2021).

Fundamental components of lexical semantic framework

The LSF identifies several key components that contribute to word meaning and structure. The semantic/grammatical skeleton serves as the foundation, outlining the core elements of a word’s meaning and syntax. It is structured hierarchically, encompassing both functions and arguments, which are integral to sentence formation (Geeraerts, 2015). Complementing this is the semantic/pragmatic body, which broadens the word’s representation by incorporating perceptual, cultural, and encyclopedic elements, thus enriching its overall meaning. LSF also distinctively employs features: it operates cross-categorially, meaning it spans across different word classes; equipollently, where multiple features contribute equally to meaning; and privatively, allowing features to either be present or absent depending on the word’s context (Goddard, 2001). A critical mechanism in this framework is co-indexation, which coordinates the arguments of various constituents within a complex word, ensuring that only those syntactically relevant remain active.

LSF has significantly influenced the fields of morphology and lexical semantics by establishing a cohesive framework for analyzing diverse word-formation processes, and enhancing theoretical clarity and consistency (Gawron, 2016). It provides precise tools for articulating the intricate layers of meaning that emerge through morphological structures, allowing for a detailed examination of complex phenomena such as polysemy and affix interaction. While LSF excels in capturing meaning nuances, its framework may still require adaptation when dealing with languages with unique morphological characteristics, as its generalizability could be limited outside of widely studied languages (Meteyard and Vigliocco, 2018). Nevertheless, its structured approach fosters advancements in computational linguistics, particularly in natural language processing and machine translation, where an in-depth understanding of word formation and semantics is essential. The potential for LSF to inform algorithms that manage semantic complexity highlights its utility, though further empirical studies may be needed to validate its applicability across diverse linguistic landscapes. The complete lexical-semantic representation of a lexeme in LSF consists of 2 parts and part 2 is further sub-divided as 2.1 segment, i.e.,

While LSF provides a structured approach to morphological-semantic analysis, it is important to acknowledge other tools that have been employed in compound analysis. Notable among these are Construction Morphology (Booij, 2010), which focuses on form-meaning pairings and productivity; Distributed Morphology (Halle and Marantz, 1993), which treats morphology as syntax-driven; and Conceptual Blending Theory (Fauconnier and Turner, 2002), which explores cognitive processes behind word formation. These models offer diverse insights, particularly in capturing productivity, syntax-semantics integration, and cognitive motivations in compounding. However, they often lack the formal semantic decomposition that LSF provides. LSF was chosen for this study precisely because of its clarity in analyzing semantic transparency, argument structure, and cross-categorial morphological behavior, all of which are crucial for understanding Urdu compounding. Moreover, while it was initially introduced in 2004, its later developments (Lieber, 2007, 2009, 2010) and continued adoption in typological research affirm its ongoing relevance. In the context of a resource-poor and under-analyzed language like Urdu, LSF offers a flexible yet formal structure that can accommodate both canonical and culturally embedded compounding phenomena.

Compounding mechanisms in the lexical semantic framework

In Lieber’s (2004, 2007) LSF, compound formation hinges on the concatenation of skeletal elements, accompanied by a process of co-indexing that operates in parallel. This approach essentially entails the systematic unification of base signs, which are the foundational elements contributing to the compound’s overall semantic structure. However, while co-indexing offers a straightforward mechanism for semantic integration, it can also obscure nuances within individual lexical items, potentially oversimplifying complex semantic relationships. This study further examines co-indexation within the three principal compound categories: argumental, coordinate, and attributive (Bauer et al., 2013), exploring whether co-indexing sufficiently accounts for the nuanced interplay of meaning unique to each category and considering its limitations in capturing more intricate semantic distinctions.

Argumental compounds

Argumental compounds form a distinctive category, marked by an inherent semantic connection between the head and non-head constituents. Synthetic compounds, such as “bus driver,” illustrate this relation, where the non-head element (‘driver’) functions as an argument—in this case, the object—of the verbal base (“bus”), which serves as the head of the compound (Lieber, 2010). These compounds typically follow a two-step derivational process. For instance, in ‘burrito assembler,’ the initial derivation begins with the base word “assembler,” where the suffix “-er” attaches without enforcing any specific semantic constraints on the argument (Lieber, 2010). In this structure:

This process demonstrates a unique alignment of the “R” argument of the affixed “-er” with the verb’s highest argument, often the external argument. Subsequently, the “R” argument of the non-head constituent ‘burrito’ aligns with the verb’s unindexed, internal argument, resulting in a coherent object-oriented interpretation of the compound. More recent studies corroborate and refine Lieber’s framework, particularly in cross-linguistic contexts. Studies on Chinese and Japanese synthetic compounds (Wang and Zhou, 2018; Irwin, 2016) suggest that while the two-step process aligns with broader typological patterns, languages may vary in the rigidity of semantic constraints applied to the affix and the argument’s alignment with head constituents. Moreover, Gagné and Spalding (2016) argue that semantic transparency within compounds can influence their interpretability, highlighting the necessity of lexical context in understanding argumental compounds. This nuanced approach reinforces the flexibility within Lieber’s (2010) LSF, accommodating both universal and language-specific derivational patterns, which underscores the object-oriented reading as a notable but not universally applied feature of synthetic compounds.

Coordinate compounds

Coordinate compounds are defined by the semantic equivalence of their constituents, each component contributing equally to the compound’s overall meaning (Bauer et al., 2013). Recent research underscores the need to consider detailed lexical-semantic representations for each member to fully capture the complexity of these structures (Sirintranon and Hsieh, 2024; Körtvélyessy and Štekauer, 2020). This approach is essential, as the close referential alignment between members enables near-identical representation, supporting the compound’s interpretive flexibility. For example, in the coordinate compound actor-author, the constituents are analyzed as follows:

The morphological structure of the actor and author reveals identical skeletal characteristics, both being classified as concrete processual entities. This alignment supports the principle of co-indexation, evident in the shared ‘R’ arguments for each lexeme, as Körtvélyessy and Štekauer, (2020) illustrate in similar cases. Additionally, the initial set of features within their semantic frameworks overlaps: both terms denote animate, human entities assigned specific societal functions. However, they diverge at the encyclopedic level, with ‘actor’ linked to performing arts and ‘author’ associated with literary creation. Recent studies further emphasize that while coordinate compounds share core properties, their encyclopedic distinctions play a pivotal role in their semantic interpretations (Arcodia and Basciano, 2018). By mapping these distinctions, this analysis demonstrates how coordinate compounds balance shared and distinct components, expanding our understanding of compound semantics across languages.

Attributive compounds

Attributive compounding within Lieber’s LSF is frequently viewed as a prototypical category of compound formation. Unlike subordinate compounds, which establish a hierarchical relationship between the constituents, or coordinate compounds, which signify co-referential entities, attributive compounds typically function through a modifier-head structure without explicit argument structure dependencies (Lieber, 2004; Bauer, 2003). This unique absence of argumentative or co-referential features can be observed in the example of a bamboo bed, where bamboo modifies the head noun bed but does not contribute any syntactic dependency.

In the LSF model, attributive compounds act as a “default” category, particularly when both constituents share the role argument R, as demonstrated by Lieber’s principle of co-indexation, though without the possibility of coordination due to their incompatible ontological features. For instance, in a bamboo bed, bamboo and bed are co-indexed under argument R to denote a relational association. However, as both constituents lack compatible body parts for a coordinative interpretation (ex. <+material> for bamboo and <+artifact> for bed), a coordinative or subordinative interpretation cannot apply (Libben et al., 2021). The meaning of bamboo bed, therefore, emerges not from syntactic structure but from pragmatic inference, guided by encyclopedic knowledge. The modifier-head relation in this case relies on an associative link where bamboo’s <function> as a material enables the inference that the bed is constructed from bamboo. Studies indicate that such compounds often require external encyclopedic knowledge to assign meaning, distinguishing attributive compounds from other types (Wechsler, 2020). Thus, the “made of” relationship in bamboo bed is pragmatically inferred rather than syntactically encoded, highlighting the LSF framework’s flexibility in accommodating non-referential modification.

Exocentric/endocentric compounds

The LSF proposes a unified approach to compound analysis by arguing that exocentric compounds do not require separate linguistic mechanisms. Instead, it suggests that both exocentric and endocentric compounds share a similar underlying structure, with exocentricity attributed to broader grammatical or metonymic interpretations rather than compounding-specific rules. This view is supported by Bauer (2003), who argues that the exocentric compound “birdbrain” exemplifies a PART FOR WHOLE metonymy. However, recent studies, such as Scalise et al. (2005), indicate that exocentric compounds may involve unique interpretive processes that challenge LSF’s assumption of uniformity in compound structure. While LSF’s approach highlights the flexibility of lexical-semantic representations, it may oversimplify the semantic complexity in exocentric compounds. For instance, recent research on cross-linguistic variations in compounding suggests that exocentric compounds often carry culturally specific meanings that standard grammatical rules cannot fully capture (Scalise, 2005). Furthermore, cognitive studies by Gawron (2016) and Meteyard and Vigliocco (2018) show that speakers interpret exocentric compounds through associative, context-dependent inferences, which may imply a need for broader theoretical models. In essence, LSF offers a robust and principled framework for delving into the intricate tapestry of meaning woven within complex lexical items, providing valuable insights for linguists grappling with the nuanced relationship between form and function in language.

Methods

In this study, a qualitative descriptive design is employed to investigate the compounding process in Urdu. This approach enables an in-depth examination of the complex meaning-making and structural interplay within compound words, focusing on how constituent elements interact to create intricate meanings. Rather than relying on quantitative metrics, this design captures the subtleties of compounding by exploring the contextual, morphological, and semantic layers in Urdu compounds. Such a detailed approach yields insights into the linguistic and cultural factors influencing compounding, which quantitative analysis might overlook.

Data collection

The study employs a purposive sampling technique, selecting 30 compounds (See Appendix A for detailed analysis) to represent prevalent compound structures in Urdu. While the sample size is relatively modest, it includes compounds with high frequency and established usage, such as noun-noun (N + N), noun-adjective (N + Adj), and noun-verb (N + V) constructions, that illustrate patterns crucial to Urdu word-formation processes. This selection process considers representativeness by focusing on compounds with varied morphological types and usage across formal and informal registers. Purposive sampling is justified as it allows the analysis to capture a breadth of compound structures central to Urdu lexicon and discourse, as supported by recent linguistic studies emphasizing the importance of focused sampling in typological analysis (Bauer and Renouf, 2001; Fabb, 2017; Sharif, 2022). Moreover, the compounds were chosen to balance morphological diversity and linguistic relevance, focusing on structures with established meanings to mitigate the limitations of a smaller sample. This approach ensures generalizability by examining stable and recurrent compounding patterns in Urdu, particularly in the lexical domains, the most influential in shaping the language’s structure.

Validity and reliability

To strengthen validity and reliability, this study employs source triangulation, and cross-referencing data with reputable linguistic resources like The Express newspaper and the Feroz-ul-Lughat dictionary. This approach ensures a robust foundation for data interpretation. Furthermore, the coding process incorporated peer validation to enhance interpretative consistency, following established qualitative research protocols. Peer validation involved multiple researchers independently verifying coding accuracy and interpretive coherence, aligning with best practices in linguistic research to support both rigor and reproducibility (Libben et al., 2021).

Technique of LSF-based data analysis

Each compound identified was systematically analyzed within the LSF, focusing on critical structural and semantic dimensions. The analysis began with determining the syntactic configuration, isolating constituent elements (head and modifier), and their syntactic relationships (e.g., N + N, Adj + N, Verb + N). Next, the semantic contribution of each constituent was evaluated, addressing factors like transparency, opacity, and figurative meanings. Co-indexing relationships in LSF illustrated how constituent meanings interact, revealing intricate semantic relations, such as possession, instrumentality, or locational references. Furthermore, Lieber’s (2010) parameters for compound classification—endocentricity, exocentricity, thematic relations, and directionality—were applied to uncover the formation processes and semantic nuances within each compound. Recognizing the complexity of cultural context, the LSF framework also accounted for the “semantic body,” allowing for a culturally informed interpretation of Urdu compounds that extend beyond skeletal lexical structure. This holistic approach provided a nuanced understanding of compound formation in Urdu, demonstrating the adaptability of the LSF framework for language-specific semantic analysis. Further, to avoid making the manuscript excessively lengthy, the analysis of all 30 compounds has been provided in Appendix A. The main text focuses on the most prevalent and commonly occurring compounds that share similar structures with those listed in Appendix A. This approach ensures conciseness while maintaining relevance.

Results and discussion

This research employed a descriptive methodology for data analysis, focusing on qualitative data gathered from primary sources. Leveraging existing linguistic resources, it probed into the lexical semantics of simple and complex Urdu compounds across diverse morphological contexts. Data extraction relied on readily available primary resources in Urdu, encompassing grammar books, dictionaries, scholarly articles, and other pertinent secondary sources focusing on compound words. A systematic investigation explored the concept of the semantic skeleton and its associated meaning, aiming to extract morphological instances of compounds within Urdu morphology. The analysis revealed that Urdu exhibits characteristics consistent with a LSF, wherein the schemata of each compound drawn which reflected the grammatical skeleton, semantic body, and encyclopedic meanings resultantly revealed the interpretation of Urdu compounds.

Coordinative compounds

Within the LSF, coordinated compounds constitute a distinct category characterized by an intrinsic equivalence relationship between their constituent lexemes. These compounds manifest a balance of semantic weight, wherein both lexical entities contribute equally and coordinate with each other. As Lieber (2009) posits, the first lexeme acts as a referential identifier for the second within the compound structure. Before analyzing the compound word-formation of the Urdu language, it’s pertinent to discuss the orthographic system of the Urdu language as it causes the representation of varied forms of compounding in the Urdu language. The orthography of the Urdu language is also unique as the lexicon of Urdu exhibits a pronounced influence from Persian and Arabic, while its orthography employs a cursive, context-sensitive Perso-Arabic script read from right to left (Butt and King, 2007). The orthography of compound word formation in the Urdu language reflects this explicitly. Following are the major categories of Urdu Compound words (Schmidt and Funke, 2025): a. AB Compound b. A-o-B Compound c. A-e-B Compound e. A-al-B Compound

a. AB compound: This compound comprises two individual stems that combine to form a single, unified stem (Rahman, 2006). While both stems contribute to the overall meaning, the secondary stem possesses specific emphasis or functional necessity within the compound. For example:

/aːs paːs/

[around]

b. A-o-B compound: In this type of compounding, two words are joined by the morpheme /o/ (‘و’ wao), producing a single meaningful word. This morpheme functions similarly to ‘and’ (aur) and is classified as harf-ETF (conjunction) (Schmidt, 1999). For example:

/ɡʰəruːr-o-təkəbʊr/

[pride]

c. A-e-B compound: This type involves the combination of two stems joined by a zair (ِِ‘ ِ ’) under the last letter of the first stem. For example:

/sədr-e-mʊmlɪkət/

[the president]

d. A-al-B compound: This type is formed with the addition of /al/ (‘ال’ al) as a linking morpheme between two stems, reflecting the Arabic root structure. For example:

/ʊmhaːt-ʊl-moːmɪniːn/

[Mothers of the believers]

AB orthographic pattern

The analysis of the Urdu compound “/ðuːð pɑtiː/ [milk tea]” within the LSF reveals notable morphological and semantic parallels between the stems “/ðuːð/ [milk]” and “/pɑtiː/ [tea leaves].” Each constituent shows a similar skeletal structure, categorized as [+M, -B, -CI([Ri])] (mass, not bounded, non-countable with referential identity). The compound co-indexes arguments across both lexemes, situating them within the substance/thing semantic frame, which indicates a close association in function and meaning. From an LSF perspective, the equivalence in morphological and semantic structures between these two elements suggests that the compound reflects an attributive relationship rather than a compositional or purely lexical relation. The compound operates within a single conceptual domain, with both constituents contributing equally to the overall meaning. This supports the notion that “/ðuːð pɑtiː/” serves not just as a linguistic unit but also as a culturally embedded term, carrying encyclopedic knowledge about its colloquial usage as a common drink in Urdu-speaking regions. Comparatively, similar findings appear in the analysis of attributive compounds in other languages, where non-compositional compounds (such as “birdbrain” in English) also reflect shared morphological and encyclopedic knowledge (Lieber, 2004; Bauer, 2003). However, while English exocentric compounds often diverge in meaning from their constituent lexemes, “/ðuːð pɑtiː/” maintains a high level of semantic transparency, demonstrating a more literal combination. This transparency aligns with Bauer et al. (2015) observations on endocentric compounds in typologically similar languages, where both elements contribute to a coherent and culturally significant meaning.

The compound /aːɡeː pɪtʃʰeː/ [back and forth] demonstrates a directional and reciprocal relationship. Each constituent (/aːɡeː/ [front] and /pɪtʃʰeː/ [back]) is syntactically aligned as a noun and adverb, and semantically characterized by [+D, +S -IEPS([Ri])] features, which denote direction, frequency, and reciprocity. This structural and semantic coherence aligns both parts as expressing movement in opposite but interrelated directions, facilitating the interpretation of the compound as a “back and forth” motion. The LSF analysis reveals that this compound achieves its meaning through aligned directionality and reciprocal semantics, a structure also observed in similar compounds from other languages. For example, compounds like “zigzag” in English operate on similar principles, where both constituents contribute equally to express a pattern of movement with an inherent reciprocal relationship (Lieber, 2010). However, unlike compounds in languages with endocentric structures, such as English “wooden chair” (with a head-modifier structure), the compound /aːɡeː pɪtʃʰeː/ relies on symmetry and reciprocal semantics rather than a primary-subordinate hierarchy. Studies such as Bauer (2003) and Lieber (2010) emphasize that languages with compounding processes influenced by symmetry and reciprocity (like Urdu and Chinese) show a unique approach to combining nouns and modifiers. These compounds deviate from traditional endocentric or exocentric categories, fitting into a category of “symmetrical compounds,” where meaning emerges from mutual interaction rather than a dominant constituent. Through the LSF, this compound can be seen as a culturally embedded representation of movement and reciprocity, aligning with observations by Wang and Zhou (2018) on Chinese directional compounds that operate similarly. Both languages use reciprocal alignment to express nuanced spatial and directional relations, illustrating that LSF can capture cultural semantics within lexical formations across diverse languages.

Analyzing the compound /kʰɑːnɑː peːnɑː/ [eat and drink] within the LSF reveals rich, layered semantic features. Each verb retains its primary domain feature, [+D ([Ri])], linked to consumption, yet differing in aspects of “liquid” and “sensory experience”. The verb /peːnɑː/ uniquely denotes [+liquid], while /kʰɑːnɑː/ lacks this specification. Despite these distinct roles, both converge within the domain of social and celebratory contexts, indicating a shared functional purpose in events involving feasting and celebration. Comparatively, studies on similar verb compounds in other languages demonstrate how cultural context shapes such semantic attributes. For example, Bauer (2003) highlights that compound words with consumption-related themes in English or German also show shared celebratory associations but lack certain unique features present in Urdu, such as the implied “sensory experience.” This aligns with Wierzbicka’s (2010) findings that cultural values embedded in language shape unique compound meanings, even when linguistic structures are similar across languages. This comparison emphasizes the importance of cultural context in understanding compounds within LSF, as the features are not merely structural but culturally enriched, suggesting the framework’s adaptability across languages with culture-specific nuances.

A-o-B orthographic pattern

The compound /mʊlkoː qoːm/ [country and nation] is analyzed through Lieber’s (2004) LSF by examining its morphosyntactic and semantic properties. In this N + N structure, each constituent possesses identity-related semantic features: [-M, -B, -CI([Ri])] for /mʊlkoː/ and [+M, +B, +CI([Ri])] for /qoːm/. Both share <+identity>, <+territorial>, and <+sovereignty> attributes, indicating their co-referential relationship. The two terms convey complementary aspects of identity, with /mʊlkoː/ as a geographical marker and /qoːm/ as a group unified by boundaries. This shared semantic structure mirrors findings in studies on similar nominal compounds in other languages, where geographic and group terms often combine to express an interdependent relationship (e.g., “land and people”). Such compounds in English or Persian typically rely on encyclopedic knowledge to provide contextual interpretability, a trait also observed in /mʊlkoː qoːm/ for Urdu speakers. Other studies, such as Lieber (2010) on English compounding, note that compounds often use a PART FOR WHOLE metonymy to blend geographical identity with collective human identity, a concept echoed here in the harmonious co-referential structure of [country and nation].

A-e-B orthographic pattern

Analyzing the phrase /bəɾəɦeː mɵɦəɾbaːɳɪ/ [kindly] through the LSF offers insights into its underlying structure and meaning. Here, the compound “bəɾəɦeː mɵɦəɾbaːɳɪ” combines the adverbial sense of “bəɾəɦeː” with “mɵɦəɾbaːɳɪ,” conveying a nuanced sense of kindness or polite request. According to LSF, each component in a compound contributes to the overall meaning based on its lexical features and skeletal schemata, including dimensions such as orientation and consistency. In this analysis, both stems (i.e., “bəɾəɦeː” and “mɵɦəɾbaːɳɪ”) exhibit a shared orientation towards politeness, with “bəɾəɦeː” introducing the modality and “mɵɦəɾbaːɳɪ” expressing the nature of the action requested. This shared schema aligns with parallel structures found in other languages, where adverbs or adjectives modify nouns to create contextually appropriate phrases. Lieber (2010) and recent studies support that such endocentric compounds (where meaning is directed internally) rely on established lexical patterns that invoke cultural or social expectations of kindness or politeness. Comparatively, studies on English politeness markers (e.g., “please” in “please wait”) demonstrate similar patterns, where lexical components work cohesively to convey the desired intent without needing explicit explanation, as users rely on cultural and contextual cues (Sirintranon and Hsieh, 2024). This analysis reinforces that LSF can effectively unpack culturally embedded expressions by identifying skeletal and semantic parallels across languages. However, while English politeness strategies may use individual words, Urdu often composes compounds to reinforce context, highlighting LSF’s flexibility in capturing language-specific compounding techniques.

Using the LSF, we interpret the compound /d͡ʒuzvɵ vəqtiː/ [part-time] by examining how its constituent stems interact semantically to create a unified meaning. In this structure, /d͡ʒuzvɵ/ represents “part,” while /vəqtiː/ signifies “time.” Both stems share a relational index [Ri] and convey distinct features that complement each other. The first stem, /d͡ʒuzvɵ/, suggests a segmented aspect (“one segment of the whole”), indicating partiality, while /vəqtiː/ references a temporal dimension (“reflects time indication”). Together, they imply a part-time concept, where time is divided or segmented. This compound aligns with Lieber’s (2010) lexical-semantic principles, particularly under the theme of partiality and duration, as both stem operate in a cohesive, non-hierarchical relationship to produce a specific temporal function. Unlike endocentric compounds where the head directs meaning, /d͡ʒuzvɵ vəqtiː/ exemplifies an attributive structure where the two stems contribute equally to define the compound’s concept. When compared to compounds in English or other languages, such as “part-time” in English, similar semantic attributes emerge, reflecting the universality of the attributive structure in expressing segmented temporal meanings. Bauer’s (2003) observations about compound flexibility in integrating cultural interpretations also apply here, as the Urdu compound can convey a cultural notion of segmented labor or time usage in a way that complements Urdu speakers’ contextual understanding.

A-al-B orthographic pattern

To interpret the compound ðarb alʔʔamθaːl (/ðarb alʔʔamθaːl/) through Lieber’s LSF, we can examine the syntactic and semantic components to reveal the exocentric nature of this compound. The compound consists of ðarb (proverb, noun) and alʔʔamθaːl (examples, noun), where each stem possesses a distinct semantic schema but converges on the artifact and function attributes. The compound’s meaning, “proverb” or “common saying,” does not directly align with either of its parts alone but instead reflects an overarching concept formed through cultural knowledge and association. The exocentric nature is evident as the meaning transcends the literal interpretations of its components, embodying a broader, culturally established role. In her work, Lieber (2009) addresses metonymy in compounds but does not fully account for cases like ðarb alʔʔamθaːl, where semantic connections rely on extensive cultural knowledge rather than a straightforward metonymic link. According to Lieber’s framework, exocentric compounds often exhibit a reliance on encyclopedic knowledge, allowing compounds to convey culturally specific meanings. This aligns with studies like Bauer (2003), which highlight that exocentric compounds rely heavily on external associations and cultural context, rather than internal linguistic cues alone. In comparison to endocentric compounds like bamboo bed (where the head clearly defines the compound’s meaning), exocentric compounds such as ðarb alʔʔamθaːl demand interpretation based on convention and established cultural knowledge, reinforcing the view that meaning is constructed through both linguistic and extra-linguistic factors. This perspective also aligns with findings by Geeraerts (2015), who notes that exocentric compounds often serve as idiomatic expressions, deriving meaning from societal conventions rather than purely syntactic structures.

AB orthographic manner

The analysis shows a significant overlap in the semantic properties of the two stems. Both share features such as <+container> and <+function>, emphasizing the interrelationship and coherence in meaning. However, the differences lie in the encyclopedic semantics of each stem. /dʒeːb/ is inherently relational, focusing on the spatial position of something (the pocket), while /χəɾtʃ/ grounds the relational position in the functional role of money being spent. This highlights how encyclopedic knowledge adds depth to the compound without altering its core structure. From a functional perspective, /dʒeːb/ behaves as a two-place predicate that involves a possessor (the person who owns the pocket) and an external argument, which is /χəɾtʃ/. The possessor argument in /dʒeːb/ is linked to the primary argument in /χəɾtʃ/, creating a unified conceptual structure around money management or personal expenditure. The compound’s meaning highlights the relationship between the physical space (pocket) and the action (spending money), thereby creating a functional relationship between space and action. These findings align with Lieber’s (2004) analysis of compounds, which highlights the interplay between the functional and semantic roles of compound elements. Bauer (2003) similarly discusses how compounds like /dʒeːb χəɾtʃ/ integrate distinct conceptual contributions from both parts to produce a holistic meaning. Unlike English compounds such as Buyer-seller (Plag, 2006), which maintain symmetrical coordination, Urdu compounds like /dʒeːb χəɾtʃ/ depend more heavily on cultural and syntactic conventions. The compound’s structure is influenced by hierarchical relationships between its constituents, where the first noun (pocket) introduces a space that the second noun (expenditure) occupies. This structure is less about coordination and more about the functional relationship between the two elements.

A-e-B orthographic manner

The analysis shows a significant overlap in the semantic properties of the two stems. Both share features such as <+relative> and <+duration>, emphasizing the interrelationship and coherence in meaning. However, the differences lie in the encyclopedic semantics of each stem. /waqtɵ/ is inherently abstract, focusing on the concept of time, while /rukhsat/ grounds this in the functional role of departure, emphasizing the irreversible and permanent nature of leaving. This distinction highlights how encyclopedic knowledge adds depth to the compound without altering its core structure. From a functional perspective, /waqtɵ/ behaves as a one-place predicate that introduces the point of time, while /rukhsat/ functions as a two-place predicate, involving both the action (departure) and the time in which it occurs. The primary argument in /waqtɵ/ (the point of time) co-indexes with the primary argument in /rukhsat/ (the event of leaving), creating a unified conceptual structure around the time of departure. This reflects a relationship where the temporal aspect of the compound is inherently linked to the action of departure, creating a functional connection between the two elements. These findings align with Lieber’s (2004) analysis of compounds, which emphasizes the interplay between the functional and semantic roles of compound elements. Similar studies, like those by Bauer (2003), discuss how compounds like /waqtɵ rukhsat/ integrate distinct conceptual contributions from both parts to produce a holistic meaning. Unlike English compounds such as Sundown (Plag, 2006), which maintain a symmetrical coordination, Urdu compounds like /waqtɵ rukhsat/ rely more on the hierarchical relationship between their constituents. The first noun (time) introduces the temporal framework, and the second noun (departure) anchors this framework in a specific event, highlighting the functional relationship between time and action.

A-al-B orthographic manner

The compound /bæt aːlmʊkədʌs/ (holy house) reflects a coordinated relationship between its two stems, /bæt/ and /aːlmʊkədʌs/. The analysis shows a significant overlap in the semantic properties of the two stems. Both share features such as <+orientation>, emphasizing the relationship between space (house) and the divine quality (sacredness). However, the differences lie in the functional properties of each stem. /bæt/ is a noun denoting a place or shelter, with the features <+shelter> and <+habitability>, which suggests that the compound refers to a space designed for living. On the other hand, /aːlmʊkədʌs/ is an adjective that introduces the feature of divinity, marked as <+sacredness> and <+divinity>, which grounds the notion of the house in its spiritual or religious context. From a functional perspective, /bæt/ behaves as a one-place predicate, identifying the house as a specific place of habitation, while /aːlmʊkədʌs/ functions as a two-place predicate, with its primary argument co-indexed with the primary argument of /bæt/. In this compound, the primary argument ‘R’ in /aːlmʊkədʌs/ co-indexes with the primary argument ‘R’ in /bæt/, creating a unified conceptual structure around the idea of a house (bæt) that is imbued with holiness (aːlmʊkədʌs). This mirrors the findings of Bauer (2008), who discusses how compounds in languages like Urdu reflect a functional relationship where an adjective can modify or qualify the meaning of the noun. Unlike compounds in English, such as Red-hot (Plag, 2003), where both elements are of the same category (adjective + adjective), compounds in Urdu often reflect a hierarchical relationship between the noun and the adjective. Here, bæt (house) acts as the primary anchor for the concept, with aːlmʊkədʌs providing an additional qualitative attribute, creating a sacred space. Further, the role of co-indexation between the two stems is significant as it highlights a semantic cohesion between the two elements, as seen in other noun-adjective compounds discussed by Lieber (2004), where the adjective modifies the noun’s inherent features but does not introduce a completely new set of arguments.

Attributive compounds

Within the LSF presented here, attributive compounds do not subscribe to either argumental or equivalence interpretations. They function as a comprehensive “everything else” category, essentially constituting the default type of compound within this framework. Notably, the semantic relationship between the first and second elements within attributive compounds exhibits remarkable flexibility, readily accommodating a diverse range of pragmatically inferred interpretations, (Bauer, 2023). The examples from 21-30 illustrated this formation in the Urdu language.

AB orthographic manner

The compound /χʊʃ mɪʐaːd͡ʒ/ [good-natured] reflects a coordinated relationship between its two stems, /χʊʃ/ and /mɪʐaːd͡ʒ/. While both elements suggest positivity, /χʊʃ/ is primarily associated with cheerfulness and consistency in an emotional state, whereas /mɪʐaːd͡ʒ/ suggests a broader hopefulness or mood orientation, often with a deeper or more enduring aspect of temperament. In comparison to other studies on compounds, such as Bauer (2003) and Plag (2006), this compound showcases the coalescing of two adjectives to form a unified concept. According to Plag (2003), English compounds often display coordination between two elements where both contribute equally to the overall meaning. In contrast, Urdu compounds like /χʊʃ mɪʐaːd͡ʒ/ often retain a functional hierarchy: the first element /χʊʃ/ (cheerfulness) provides the primary trait, while the second element /mɪʐaːd͡ʒ/ (mood) refines or deepens the conceptualization. Lieber (2004) also highlights that in compounds like this, the two stems can both contribute to the overall meaning but do so in a way that involves a subtle semantic layering. The first adjective might introduce a more general concept (cheerfulness), while the second one (mood) adds a specific emotional orientation, which is typical in many Urdu adjective-adjective compounds. Further, the relationship between the two stems involves both shared semantic features (abstract, emotional) and subtle distinctions (consistency vs. orientation), reflecting the cultural and functional depth of the compound. This differs from English compounds like good-natured, which maintain a simpler, more direct coordination between the elements.

A-o-B orthographic pattern

The compound /raħm əkram/ (blessing and warmth) reflects a coordinated relationship between its two stems, /raħm/ and /əkram/. The analysis shows a significant overlap in the semantic properties of the two stems. Both share features such as <+abstract> and <+orientation>, which indicate that the compound expresses a general abstract positive sentiment and a directional force, namely, the orientation of goodness or support. In line with studies by Bauer (2003), compounds like /raħm əkram/ display a blending of semantic properties, wherein the meanings of the individual stems cannot be directly derived from the sum of their parts but are shaped by cultural and conceptual factors. Both /raħm/ and /əkram/ point to non-concrete, abstract concepts that are culturally and socially recognized as virtues. This form of semantic cohesion contrasts with more literal compounds, such as those found in some Germanic languages, where the meaning often remains closely tied to the literal meanings of the components (e.g., bookcase in English). Similarly, compounds in Urdu like /raħm əkram/ show a higher degree of semantic transparency, in the sense that both stems are inherently linked to human qualities and virtues, which may be understood with little need for contextual or encyclopedic knowledge. However, as Plag (2006) notes, Urdu compounds also exhibit functional interdependencies that may not always be immediately clear in other languages, especially those with more rigid syntactic structures like English.

A-e-B orthographic pattern

The compound /ɾɑɦɵ ɾaːʂʈ/ [Right path] exemplifies an Adj + N structure, where the adjective /ɾɑɦɵ/ (right) modifies the noun /ɾaːʂʈ/ (path), establishing an attributive relationship. This relationship reflects societal norms and moral correctness, with “right” contributing abstract properties like dimension and consistency, while “path” emphasizes functionality and alignment with conventions. The co-indexing of [Ri] between constituents suggests a shared conceptual domain, interpreting the compound as a metaphorical “correct way.” This compound’s semantic transparency aligns with Lieber’s (2010) observations on compositional clarity. However, its abstract interpretation supports Bauer’s (2003) argument about the role of cultural contexts in shaping compound meanings, especially in non-Western languages. Similar to the findings Scalise et al. (2009), the compound integrates a cultural layer, conveying broader societal values. Modifier-driven semantic shifts, as noted by Scalise et al. (2005), are evident as /ɾɑɦɵ/ (right) amplifies the head’s connotative meaning. Unlike exocentric compounds such as /bɜːd-breɪn/ (birdbrain), which rely on metonymy, /ɾɑɦɵ ɾaːʂʈ/ maintains a straightforward attributive relationship, highlighting its focus on shared dimensions rather than figurative interpretations. This analysis underscores the compound’s alignment with cultural and linguistic norms while differentiating it from exocentric structures described in other studies.

A-al-B orthographic pattern

The compound /ˈaʃraf alˈmukhluːqaːt/ (honorable creature) reflects a coordinated relationship between its two stems, /ˈaʃraf/ and /alˈmukhluːqaːt/. The analysis shows a significant overlap in the semantic properties of the two stems. Both share features such as <+animate> and <+consistency>, indicating that the compound refers to living beings with a consistent, stable nature. The differences between the two stems lie in their conceptual contributions. /ˈaʃraf/ is an adjective that emphasizes prestige and honor, while /alˈmukhluːqaːt/ is a noun that emphasizes creation by Allah and the inherent divine origin of these creatures. Thus, the two stems combine to highlight the moral and divine status of the creatures, creating a compound that reflects both a humanly acknowledged honor and a divine origin. From a functional perspective, /ˈaʃraf/ behaves as an adjective, providing a qualitative aspect to the noun, while /alˈmukhluːqaːt/ serves as a noun denoting the class of creatures. The adjective modifies the noun, emphasizing the prestige of Allah’s creation. Both elements are closely related in meaning but differ in their contribution to the overall compound. The adjective modifies the noun in a way that aligns with Lieber’s (2004) analysis of compound structures, where one element functions as an attributive modifier of the other, producing a unified conceptual structure. These findings align with Lieber’s (2004) discussion of adjective-noun compounds, where the adjective contributes a qualitative dimension to the noun, often expressing a moral or evaluative stance (e.g., honorable). This structure is also seen in studies on compounding in other languages, such as Bauer (2003), who emphasizes how compounds like /ˈaʃraf alˈmukhluːqaːt/ rely on the hierarchical relationship between adjective and noun. In contrast to English compounds like high school (Bauer, 2003), which tend to maintain a more neutral or descriptive relationship between their components, compounds like /ˈaʃraf alˈmukhluːqaːt/ are deeply embedded in cultural and religious contexts. The first element (honorable) adds a prestigious or moral value to the second element (creature), which already carries a specific religious and divine connotation.

A deeper understanding of Urdu compounding must also consider sociolinguistic factors that shape lexical creativity. Elements such as bilingualism, code-switching, regional dialectal variation, and cultural-religious expressions significantly influence compound formation in both formal and informal Urdu. For instance, compounds like /ʊmhaːt-ʊl-moːmɪniːn/ reflect religious and Arabic influences, while others like /dʒeːb χəɾtʃ/ are driven by everyday socioeconomic needs. These patterns suggest that compounding functions as both a linguistic and a sociocultural phenomenon, shaped by factors such as education, media, and social class. To enhance NLP applications such as machine translation and semantic search in Urdu, toolkits must account for regional and functional variants of compounds, especially their varying levels of semantic transparency. Integrating sociolinguistic corpora, developing compound-aware tokenizers, and training models on culturally embedded word-formation patterns would be vital steps for advancing Urdu computational linguistics.

Conclusion

This study has employed the LSF to investigate compounding, a keyword formation process in the Urdu language. By analyzing various compound types, such as noun-noun, noun-adjective, noun-verb, and verb-verb combinations, the study has highlighted several important findings. The research reveals both semantic transparency and opaqueness within compound structures, demonstrating a spectrum of relationships where constituent meanings directly contribute to the whole (transparency) and where new, emergent meanings arise (opaqueness). Analyzing these relationships through the LSF framework provides a comprehensive understanding of how meaning is constructed in compounds. Additionally, the study identifies the morphological and syntactic constraints governing compound formation in Urdu, highlighting specific patterns and limitations in constituent selection and combination. The study also demonstrates the productivity and lexicalization of different compound types, offering insights into how Urdu utilizes compounding as a strategy to enrich its vocabulary. This further emphasizes the language’s expressive capacity and its ability to generate new words. The cross-linguistic comparison enabled by the LSF framework provides a broader perspective on compounding patterns and semantic relationships, uncovering both unique features of Urdu compounds and universal aspects of word formation across languages. In conclusion, applying the LSF to the study of Urdu compounding has proven to be a valuable approach, shedding light on the internal mechanisms of word formation while offering broader insights into the language’s semantic richness and flexibility. The findings of this study enhance our understanding of Urdu and offer a deeper perspective on word formation across languages. Future research could explore several key areas to deepen our understanding of Urdu compounding. One direction could be investigating the role of compounds in the development of neologisms, particularly in digital and media contexts. Another potential area of exploration is examining the influence of sociolinguistic factors, such as code-switching and language contact, on compounding processes, especially in bilingual or multilingual settings. Additionally, expanding cross-linguistic studies by comparing compounding in Urdu with other South Asian languages, such as Hindi or Punjabi, could reveal regional patterns and shared linguistic mechanisms in word formation. These areas of study would deepen our understanding of compounding in Urdu and contribute to the broader field of morphology and lexical semantics, ultimately advancing our knowledge of language structure and evolution.

Limitations of the study

This study provides valuable insights into the compounding process in the Urdu language using Lieber’s LSF, yet several limitations merit consideration. The purposive sample of 30 compounds, while allowing for detailed qualitative analysis, limits the generalizability of findings to the broader Urdu lexicon. Moreover, the exclusive reliance on The Express newspaper and Feroz-ul-Lughat dictionary as data sources narrow the scope, potentially overlooking the diversity of compounding in other registers, dialects, or informal contexts. The analysis may also reflect biases inherent in the selected framework and sources, which could obscure alternative semantic interpretations or regional variations within Urdu-speaking communities. Although the LSF offers a robust analytical tool, its application to a typologically distinct language like Urdu may require further adaptation to address culturally specific or non-structural aspects of compounding that were not fully explored. Additionally, the absence of quantitative methods limits the ability to measure the frequency, productivity, or statistical significance of the observed patterns. While the study focuses on major compounding types, it does not account for less common or emerging forms influenced by contemporary linguistic phenomena such as code-switching or loanwords. Addressing these limitations in future research would enhance the comprehensiveness and applicability of the findings, contributing to a deeper understanding of Urdu’s morphological richness and its implications for computational and theoretical linguistics.

Author contributions

Tahir Saleem structured the manuscript, critically developed the debate within the existing study, and created the analytical tension, while also proofreading, polishing, and editing the final version. Ms. Shumaila contributed by preparing the initial draft and conducting the data analysis.

Data availability

The relevant data for the current study is made available as supplementary material, ensuring transparency and accessibility for further research.

Competing interests

The authors declare no competing interests.

AI disclosure

This manuscript has been supported by the use of ChatGPT 4 for various aspects of preparation, including language polishing, proofreading, and editing.

Supplementary information

The online version contains supplementary material available at https://doi.org/10.1057/s41599-025-04982-x.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Ambreen S, To CKS (2024) Review of the phonological system of contemporary Urdu spoken in Pakistan. Int J Speech-Lang Pathol, 1–12. https://doi.org/10.1044/2021_JSLHR-21-0014

Andreou M (2017) The lexical semantic framework for morphology. Oxf Res Encycl Linguist. https://doi.org/10.1093/acrefore/9780199384655.013.255

Arcodia GF, Basciano B (2018) The construction morphology analysis of Chinese word formation. In: Booij G (ed) The construction of words. Studies in morphology, vol 4. Springer, Cham, pp 219–253. https://doi.org/10.1007/978-3-319-74394-3_9

Aziz, A; Saleem, T; Maqsood, B; Ameen, Z. Grammatical and syntactical functions of auxiliaries in English and Urdu. Amazon Investig; 2020; 9, 35 pp. 34-50. [DOI: https://dx.doi.org/10.34069/AI/2020.35.11.3]

Batool, R; Saleem, T. Comparative construction morphology of diminutive forms in English and Urdu. Cogent Arts Humanit; 2023; 10, 1 pp. 12-38. [DOI: https://dx.doi.org/10.1080/23311983.2023.2238998]

Bauer, L. The productivity of (non-) productive morphology. Ital J Linguist; 2003; 15, pp. 7-16. https://www.italian-journal-linguistics.com/app/uploads/2021/06/02.Bauer_.pdf

Bauer, L; Renouf, A. A corpus-based study of compounding in English. J Engl Linguist; 2001; 29, 2 pp. 101-123. [DOI: https://dx.doi.org/10.1177/00754240122005251]

Bauer L (2008) Exocentric compounds. Morphology 18:51–74. https://doi.org/10.1007/s11525-008-9122-5

Bauer L (2023) The birth and death of affixes and other morphological processes in English derivation. Languages 8(4):244. https://doi.org/10.3390/languages8040244

Bauer L, Körtvélyessy L, Štekauer P (2013) Semantics of complex words (Vol. 3). Springer. https://doi.org/10.1007/978-3-319-14102-2_1

Bauer L, Lieber R, Plag I (2015) The Oxford reference guide to English morphology. Oxford University Press. https://global.oup.com/academic/product/the-oxford-reference-guide-to-english-morphology-9780199579266?

Booij G (2010) Construction morphology. Lang linguist compass 4(7):543–555. https://doi.org/10.1093/acrefore/9780199384655.013.254

Butt, M; King, TH. Urdu in a parallel grammar development environment. Lang Resour Eval; 2007; 41, pp. 191-207. [DOI: https://dx.doi.org/10.1007/s10579-007-9042-8]

Dar, MARIAM; Anwaar, HUMA; Vihman, MARILYN; Keren-Portnoy, TAMAR. Developing an Urdu CDI for early language acquisition. Y Pap Linguist; 2015; 14, 2 pp. 1-14. [DOI: https://dx.doi.org/10.1080/02699206.2017.1308553]

Daud, A; Khan, W; Che, D. Urdu language processing: a survey. Artif Intell Rev; 2017; 47, pp. 279-311. [DOI: https://dx.doi.org/10.1007/s10462-016-9482-x]

Dogar, MF; Saleem, T; Aslam, M; Khan, SY. Exploring global linguistic nuances: analyzing region-specific inflectional morpheme frequency in ICNALE. Asian-Pac J Second Foreign Lang Educ; 2024; 9, 1 [DOI: https://dx.doi.org/10.1186/s40862-024-00291-z] 65.

Fabb N (2017) Compounding. In: Andrew S, Arnold MZ (eds) The handbook of morphology. John Wiley & Sons, pp 66–83. https://doi.org/10.1002/9781405166348.ch3

Fauconnier G, Turner M (2002) Conceptual blending, form and meaning. Rech en commun 19:57–86. https://doi.org/10.14428/rec.v19i19.48413

Gagné CL, Spalding TL (2016) Effects of morphology and semantic transparency on typing latencies in English compound and pseudocompound words. J Exp Psycho: Lear Mem Cognit 42(9):1489. https://doi.org/10.1037/xlm0000258

Gawron JM (2016) Lexical representations and the semantics of complementation. Routledge. https://doi.org/10.4324/9781315527338

Geeraerts D (2015) Lexical semantics. In: Ewa D, Dagmar D (eds) Handbook of cognitive linguistics, 273–295. https://doi.org/10.1515/9783110292022

Goddard, C. Lexico-semantic universals: a critical overview. Linguist Typol; 2001; 5, 1 pp. 1-65. [DOI: https://dx.doi.org/10.1515/lity.5.1.1]

Halle M, Marantz A (1993) Distributed morphology and the pieces of inflection. In Hale K, Keyser SJ (Eds), The view from building 20 (pp. 111–176). The MIT Press

Hamawand, Z. The status of interjections in Cognitive Grammar. Cogn Linguist Stud; 2024; 11, 2 pp. 274-295. [DOI: https://dx.doi.org/10.1075/cogls.00122.ham]

Haroon, S; Aslam, M; Saleem, T. Exploring the cross-linguistic functioning of the Principles of WH-Movements: the case of Pakistani ESL learners. Cogent Arts Humanit; 2023; 10, 1 pp. 21-48. [DOI: https://dx.doi.org/10.1080/23311983.2023.2174518]

Irwin M (2016) The morphology of English loanwords. Handbook of Japanese lexicon and word formation, pp 161–197. https://doi.org/10.1515/9781614512097-009

Khan, A; Saleem, T; Khan, AA; Azam, S. Syntax and morphology of Baniswola Pashto: investigating universal and dialectal variations. Cogent Arts Humanit; 2025; 12, 1 pp. 24-43. [DOI: https://dx.doi.org/10.1080/23311983.2024.2448073]

King RD (2008) Language politics and conflicts in South Asia. In: Braj BK, Yamuna K, Sridhar SN (eds) Language in South Asia. Cambridge University Press, Cambridge

Kiparsky, P. Phonology to the rescue: Nez Perce morphology revisited. Linguist Rev; 2021; 38, 3 pp. 391-442. [DOI: https://dx.doi.org/10.1515/tlr-2021-2071]

Körtvélyessy L (2020) Onomatopoeia–A unique species? Studia Linguistica 74(2):506–551. https://doi.org/10.1111/stul.12133

Lakoff, G; Johnson, M. The metaphorical structure of the human conceptual system. Cogn Sci; 1980; 4, 2 pp. 195-208. [DOI: https://dx.doi.org/10.1207/s15516709cog0402_4]

Libben, G; Gallant, J; Dressler, WU. Textual effects in compound processing: a window on words in the world. Front Commun; 2021; 6, 646454. [DOI: https://dx.doi.org/10.3389/fcomm.2021.646454]

Lieber, R. The category of roots and the roots of categories: What we learn from selection in derivation. Morphology; 2007; 16, 2 pp. 247-272. [DOI: https://dx.doi.org/10.1007/s11525-006-9106-2]

Lieber R (1992) Deconstructing morphology: Word formation in syntactic theory. University of Chicago Press. https://press.uchicago.edu/ucp/books/book/chicago/D/bo3618073.html

Lieber R (2004) Morphology and lexical semantics. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511486296

Lieber R (2009) A lexical semantic approach to compounding. In: Lieber R, Štekauer P (eds) The Oxford handbook of compounding. Oxford University Press, pp 78–104. https://doi.org/10.1093/oxfordhb/9780199695720.013.0005

Lieber R (2010) On the lexical semantics of compounds: non-affixal (de)verbal compounds. In: Scalise S, Vogel I (eds) Cross-disciplinary issues in compounding. John Benjamins, Amsterdam/Philadelphia, pp 127–144. https://doi.org/10.1075/cilt.311.11lie

Lieber R (2021) Introducing morphology. Cambridge University Press. https://www.cambridge.org/highereducation/books/introducingmorphology/301B4B3CEB4045756E2112603B719426#overview

Lieber R, Baayen H (1999) Nominalizations in a calculus of lexical semantic representations. In: Booij G, Marle J van (eds) Yearbook of morphology 1998. Springer, Dordrecht, pp. 175–198. https://doi.org/10.1093/acrefore/9780199384655.013.255

Maqsood, B; Aziz, A; Saleem, T; Summiya Azam, TS. A comparative study of WH-movement in Urdu and English: a minimalist perspective. Int J Engl Linguist; 2018; 8, 6 pp. 203-251. [DOI: https://dx.doi.org/10.5539/ijel.v8n6p203]

Maqsood B, Aziz A, Summiya Azam TS (2018) A Comparative Study of WH-Movement in Urdu and English: A Minimalist Perspective. Int J Eng Linguist 8(6):203–251. https://doi.org/10.5539/ijel.v8n6p203

Meteyard, L, & Vigliocco, G (2018). Lexico-semantics. The Oxford handbook of psycholinguistics, 71-90. https://doi.org/10.1093/oxfordhb/9780198786825.001.0001

Nabi, FG; Sultan Ali Khan, B; Saleem, T. The structure of inflected Urdu nominals: insights from Distributed Morphology. Cogent Arts Humanit; 2025; 12, 1 pp. 24-43. [DOI: https://dx.doi.org/10.1080/23311983.2025.2464383]

Noreen, A; Muneer, I; Nawab, RMA. Mono-lingual text reuse detection for the Urdu language at lexical level. Eng Appl Artif Intell; 2024; 136, 109003. [DOI: https://dx.doi.org/10.1016/j.engappai.2024.109003]

Plag, I. The variability of compound stress in English: structural, semantic, and analogical factors. Engl Lang Linguist; 2006; 10, 1 pp. 143-172. [DOI: https://dx.doi.org/10.1017/S1360674306001821]

Plag I (2003) The variability of compound stress in English: structural, semantic, and analogical factors. Eng Lang Linguist 10(1):143–172. https://doi.org/10.1017/S1360674306001821

Pustejovsky, J; Bouillon, P. Aspectual coercion and logical polysemy. J Semant; 1995; 12, 2 pp. 133-162. [DOI: https://dx.doi.org/10.1093/jos/12.2.133]

Rahman T (2006) Language policy, multilingualism and language vitality in Pakistan. Trends in linguist stud monogr 175(1):73–98. https://doi.org/10.1515/9783110197785

Saeed JI (2015) John Wiley & Sons. Semantics (Vol. 25). https://download.ebookshelf.de/download/0003/7280/11/L-G-0003728011-0007589664.pdf

Scalise, S; Bisetto, A; Guevara, E. Selection in compounding and derivation. Morphol Demarcations; 2005; 133, 1 150. [DOI: https://dx.doi.org/10.1075/cilt.264.09sca]

Scalise S (2005) Italian compounds. Int J Latin Roman Linguist 24(1):61–91. https://doi.org/10.1515/probus-2012-0004

Scalise S, Magni E, Bisetto A (eds) (2009) Universals of language today. Springer, Heidelberg

Schank, RC. Inference in the conceptual dependency paradigm: a personal history. Lang, mind, brain; 2019; 7, 8 pp. 103-128. [DOI: https://dx.doi.org/10.4324/9781315792286-9] https://www.taylorfrancis.com/chapters/edit/10.4324/9781315792286-9/inference-conceptual-dependency-paradigm-personal-history-roger-schank

Schmidt, K; Funke, N. Exploration of the mandative subjunctive in Pakistani English. World Englishes; 2025; 44, 1-2 pp. 202-217. [DOI: https://dx.doi.org/10.1111/weng.12697]

Schmidt RL (1999) Urdu: An essential grammar. Routledge

Sharif, AN. Data sharpening and linguistic theorizing: a case study of the causative derivation of Urdu change-of-state verbs. J South Asian Lang Linguist; 2022; 9, 1-2 pp. 29-67. [DOI: https://dx.doi.org/10.1515/jsall-2023-1003]

Sirintranon, N; Hsieh, FF. How word stress is realized in Thai: evidence from the ordering of coordinate compounds. Linguist Rev; 2024; 41, 2 pp. 225-246. [DOI: https://dx.doi.org/10.1515/tlr-2024-2008]

Wang, W; Zhou, W. A contrastive study of English and Chinese synthetic compounds. Can Soc Sci; 2018; 14, 5 pp. 17-26. https://api.semanticscholar.org/CorpusID:125606471

Wechsler, S. The role of the Lexicon in the Syntax–semantics interface. Annu Rev Linguist; 2020; 6, 1 pp. 67-87. [DOI: https://dx.doi.org/10.1146/annurev-linguistics-011619-030349]

Wierzbicka A (2010) Lexical universals of kinship and social cognition. Behav Brain Sci 33(5):403. https://doi.org/10.1017/S0140525X10001433

Word count: 9584

Show less

From structure to meaning: a lexical semantic framework for Urdu compounding

Content area

Abstract

Full text