Multimodal insights: enhancing cultural promotion through analysis of Saudi Arabian audiovisual productions
Introduction
Motivated by the idea that cognitive and pragmatic frameworks in communication and discourse processing hold significant promise for Audiovisual Translation (AVT) research, Braun (2016) examined these models and highlighted their impact on understanding the three interconnected subprocesses inherent in all forms of AVT: the translator’s comprehension of multimodal discourse, the translation of specific elements within this discourse, and the target audience’s comprehension of the newly translated multimodal discourse.
The topic of multimodality has been recently a persistent area of study in AVT research. The concept of multimodality recognizes that communication is often complex and that people use multiple semiotic resources to make sense of the world and interact with others. Multimodal texts are increasingly common in our digital age (Van Leeuwen, 2015), where technology allows for the easy integration of various modes. A multimodal text uses more than one mode of communication to convey its message. Modes of communication can include linguistic elements (spoken or written language), visual elements (images, graphics, colors), audio elements (music, sound effects), gestural elements (body language, facial expressions), spatial elements (layout, spatial organization), and others such as olfactory and tactile cues. In a multimodal text, these different modes work together to create meaning. For example, a television advertisement might combine moving images, spoken words, music, and text to persuade viewers to buy a product. Similarly, a webpage might use text, images, hyperlinks, and video clips to provide information and engage users.
Multimodal texts pose translation challenges due to the interplay of diverse communication modes like linguistic, visual, audio, gestural, and spatial elements. As seen in AVT, the intricate relationships between these modes demand synchronization in dubbing and subtitles. In contrast, translating static images necessitates carefully considering visual-textual synergy and cultural influences (Torresi, 2008). The presence and interaction of these modes redefine the translation process, requiring translators to comprehend linguistic content and the interplay with non-verbal elements, emphasizing the need for a nuanced understanding of both source and target cultures (Tuominen et al., 2018). Overall, the multimodal nature of texts adds complexity, mandating translators to navigate intermodal dynamics for effective communication.
The pervasive presence of audiovisual media has significantly shaped cultural representation and promotion in our interconnected global landscape. In recent years, Saudi Arabia, recognizing the transformative power of multimedia communication, has actively embraced this trend. This is particularly evident in Vision 2030 and its associated economic objectives, which have fueled a concerted effort to showcase the richness of Saudi culture globally. The production of diverse audiovisual materials has been a strategic response, serving as a dynamic tool for raising awareness about Saudi cultural traditions and values internationally.
A noteworthy example of such audiovisual content is a short video crafted by the Ministry of Culture centered around “The Year of the Saudi Coffee.” This production strategically focuses on introducing Arabic coffee as a symbolic representation of generosity and social traditions deeply ingrained in Saudi society. Each step of serving coffee is meticulously portrayed, shedding light on its nuanced implications within the social setting. The video serves as a microcosm of broader cultural promotion efforts, aiming to showcase traditions and stimulate interest and understanding among a global audience.
This study presents a descriptive analysis model that investigates using Dicerto’s (2018) analytical framework for subtitling to promote Saudi culture through audiovisual content. Employing a multimodal approach, the study aims to optimize the impact of such materials on international audiences by proposing translation strategies, focusing on Saudi Arabia as the research context. Thus, the data collected for this study constitutes a representative sample specific to this context.
In this study, the researcher introduces a qualitative analysis approach to multimodal texts, beginning with an overarching framework that establishes the general dimensions of analysis before delving into more specific components of a source text (ST) and their interactions. Drawing from a pragmatic perspective (Dicerto, 2018), the primary aim of this research is to understand the sender’s intention, as it is fundamental to decoding multimodal messages. Central to this investigation is the distinction between explicit and implicit meaning within the context of Relevance Theory. Additionally, the paper explores the visual-verbal relationships that contribute to multimodal message formation and examines the individual meanings conveyed by various modes.
Multimodality and translation challenges
Translating multimodal texts presents numerous challenges, as the process is inherently complex given the diverse semiotic modes interacting to convey meaning. For example, systemic functional analysis and corpus-based methods in multimodal translation research have some challenges and limitations. Systemic functional analysis provides systematic methods for examining multimodal content, yet it encounters challenges in multimodal translation research (Tuominen et al., 2018). An issue arises where metafunctional analyses of distinct modes, when conducted independently, may not seamlessly integrate, resulting in isolated monomodal analyses instead of a fully integrated multimodal analysis. Furthermore, systemic functional analyses yield intricate descriptions of components in multimodal texts without generating overarching conclusions about multimodal meaning construction, termed the “infinite detail” predicament (Martinec and Salway, 2005, p. 341).
Conversely, Tuominen et al. (2018) note that corpus-based approaches in multimodal translation research face difficulties. Despite the longstanding application of corpus analysis in Translation Studies to various facets of monomodal texts, its extension to multimodal texts introduces heightened complexity. Technological constraints impede the seamless alignment of source texts with target texts or different modes within the text. Software tools can only semi-automatically associate visual and audio elements with translated modes, encountering challenges in segmenting the text into analyzable units and providing subsequent semantic annotation.
In her work, El-Farahaty (2018) addressed the difficulties encountered in translating Arabic political webcomics, focusing on cultural and intertextual aspects. Employing a multimodal framework to comprehend visual and verbal modes within their socio-political contexts, she highlighted the necessity of modifying or omitting culture-specific elements in translation. Additionally, she proposed techniques for seamlessly integrating visual and verbal components. Furthermore, El-Farahaty (2018) utilized Serafini’s tripartite framework of perception, structure, and ideology, in conjunction with Van Leeuwen’s (2015) visual design elements, to analyze the data, considering multiple layers of meaning and the dynamic interplay between visual and verbal elements.
These challenges highlight the need for a broader methodological toolkit that moves beyond a verbal focus and integrates the analysis of non-verbal modes with verbal information in multimodal translation research. In the present study, Dicerto’s (2018) analysis model is tested to facilitate the AVT of multimodal material from Arabic into English. The study hypothesizes that the model helps translators find the most suitable strategies to handle specific cultural references, i.e., translating traditional practices, such as serving Saudi coffee, and their cultural references in an audiovisual context.
In exploring the intricate intersection of multimodality and translation, scholars have delved into the complexities arising from the interplay of diverse communication modes, including linguistic, visual, audio, gestural, and spatial elements. For example, multimodality in AVT, especially within children’s animated films, is a significant research area (Yanti, 2022). These films, rich in visual, audio, spatial, gestural, and linguistic elements, present translation challenges, as explored by Mujiyanto (2019), who advocates for a nuanced understanding of the entire semiotic impact of audio-verbo-visual translation. Taylor (2016) underscores the vital role of integrating semiotic modalities in film texts, highlighting the potential of multimodal analyses to enhance the practice of AVT.
Chiaro (2010) states that the challenges in translating complex jokes and cultural references in an audiovisual context stem from the polysemiotic nature of audiovisual texts, the constraints of the audiovisual realm, and the need to balance faithfulness to the source text with the retention of humor in the target text. In line with his assumption, the challenges in subtitling cultural items are multifaceted. From a linguistic perspective, subtitlers face technical challenges related to the number of characters allowed per line and the time limit for displaying subtitles. Additionally, linguistic challenges arise from differences between the source and target languages regarding syntax, vocabulary, style, and semantics. Cultural challenges also present significant obstacles, as subtitlers must navigate differences in cultural norms, references, and idioms between the source and target audiences. For instance, humor and cultural-specific references can be particularly challenging to translate effectively. These challenges are further compounded by the constraints of audiovisual content, which limit the subtitlers’ ability to provide detailed explanations.
As a solution to dealing with multimodal translation challenges, Silvester (2022) underscores the importance of collaboration among audiovisual material producers, filmmakers, directors, and subtitlers in subtitling French independent films into English. This emphasizes the need for further research into the subtitling process and quality control specific to this context.
Cultural aspects in multimodal texts and audiovisual translation
A significant obstacle encountered in translating multimodal texts pertains to cultural nuances. These intricacies, commonly embedded within humor, politeness, greetings, and other elements of multimodal discourse, present formidable challenges for translators engaged in subtitling and dubbing processes. Translating such cultural elements requires careful consideration and adeptness in navigating the interplay between linguistic and cultural contexts, which may significantly impact the effectiveness and fidelity of the translated output. For example, Altahri (2013) notes that subtitling for children in AVT involves navigating a complex interplay of cultural, ideological, and technical factors. Culturally, subtitlers must adapt content to resonate with the target audience’s cultural norms and values, mainly when translating from dominant to distinct cultures like English to Arabic. Ideologically, they face the challenge of maintaining the original message while aligning with age-appropriate and culturally sensitive themes. Technically, subtitlers encounter limitations in the software, time constraints, and space on the screen, necessitating creative solutions to convey messages effectively. Skilled and culturally sensitive subtitlers ensure successful communication and resonance with the target audience.
In dubbing and subtitling, transcreation is an adaptation process that involves changing words and senses to fit the target language and readers to keep the desired effect. Transcreation contributes to overcoming cultural challenges in subtitling cultural references by providing a creative and effective tool to bridge the linguistic and cultural gap while maintaining the original message and effect. It is closely related to previous translation studies, such as dynamic equivalence and the functionalist approach. However, it differs from traditional translation approaches in that it focuses on adapting cultural references to the target audience, ensuring that the meaning and effect of the original cultural references are effectively conveyed. This approach emphasizes creativity, cultural adaptation, and accuracy in delivering the intended meaning, making it particularly suitable for addressing cultural problems in subtitling (Alawad and Alosaimi, 2023).
Transcreation allows subtitlers to adapt cultural references to ensure the rendering of the original text’s message, image, effect, and style, thus making it easily understood by the audience. By creatively reinterpreting the content while preserving the meaning, transcreation helps to fill the linguistic and cultural gap, making it an effective strategy for addressing the challenges posed by cultural references in AVT. Furthermore, transcreation can compensate for losses, overcome technical requirements, and provide a good experience for the audience, making it a valuable tool for subtitlers. Alawad and Alosaimi (2023) tested the application of the transcreation procedure by collecting examples of cultural references from the animated Arabic movie “Masmeer” and analyzing their translations based on the three fundamental elements of transcreation: accuracy in delivering the meaning, creativity, and cultural adaptation. Their analysis also focuses on the translation’s communicative effect on the audience, considering different channels such as verbal auditory, non-verbal auditory, verbal-visual, and non-verbal visual channels. Each chosen excerpt is discussed individually to examine whether the application of transcreation delivered the meaning and effect accurately and creatively, like the original product.
Pavesi’s (2018) examination of dubbing in AVT underscores its linguistic nuances, emphasizing the routine nature of translational solutions. The study highlights fixed patterns in the language of dubbing, evidenced by established translational routines like “sì signore” for “yes sir” in Italian dubbing. Additionally, she explores the creative aspects of dubbing in response to cross-linguistic and cross-cultural challenges, particularly in translating regional and social dialects. This creativity, evolving from individual acts to social processes, is exemplified by the innovative use of distinctive street culture language. Pavesi’s work contributes to a comprehensive understanding of routine and creative dimensions within AVT.
Pavesi’s (2018) model proposes three complementary dimensions: naturalness and register specificity, target language orientation and source language interference, and routinization and creativity. These dimensions address the balance between naturalness and the appropriate register, ensuring target language orientation while minimizing source language interference and navigating the tension between established routines and creative adaptation in dubbing. This model serves as a framework for analyzing the linguistic aspects of AVT, emphasizing the complexity of the audiovisual text. The study aims to illuminate the translation of casual conversation, explore cross-linguistic correspondences, and advocate for further sociolinguistic, diachronic, and cross-cultural investigations into the language of dubbing.
Translating comedic content in dubbing demands meticulous consideration of cultural nuances and connections. Camilli (2019) proposed a model for dubbing complex jokes that involves identifying the linguistic features at play in the wordplay, considering the historical and cultural closeness of the source and target languages, and utilizing various translation techniques to retain the humorous effect of wordplay in the target language. The model also emphasizes the need for further research in AVT, particularly with larger samples and different language pairs, to expand on the findings and re-examine existing translation models when applied to audiovisual media.
To understand film discourse in AVT, Mubenga (2009) suggests a Multimodal Pragmatic Analysis (MPA) methodology that contributes to the understanding of film discourse in AVT by providing a framework that integrates linguistic and semiotic analysis. This approach comprehensively examines how verbal, visual, and acoustic elements are combined in film discourse, particularly in interlingual subtitling. MPA retains the text-focused flavor of linguistic discourse analysis while incorporating semiotic concepts, thus enabling the analysis of both linguistic and non-linguistic signs within film discourse. By doing so, MPA offers a nuanced understanding of the complexities involved in translating the multimodal nature of film texts, where meaning is conveyed through a combination of different semiotic modes.
The MPA methodology is a comprehensive model designed to study film discourse within the context of AVT, particularly subtitling. The model is structured around three core components and corresponding levels of analysis for each component: (1) The functional component, which is derived from Halliday and Matthiessen (2004) Functional Grammar and focuses on the verbal element in the film, analyzing patterns of transitivity, mood, thematic structure, and cohesion; (2) The semiotic component, which is based on Van Leeuwen’s (2005) Grammar of Visual Design, deals with the visual elements in the film frame, examining patterns of representation, interaction, and composition; (3) The cognitive component, which draws from Cognitive Linguistics (Fawcett, 1980), Cognitive Psychology (De Vega, 1984), and Frame Semantics (López, 2002), involves the activation of knowledge structures in long-term memory, making inferences at the clause level, integrating verbal, visual, and acoustic elements, and connecting this integrated information with world knowledge in the context of culture. MPA’s contribution is significant in that it provides a structured and comprehensive way to analyze and understand the intricacies of film discourse in the realm of AVT, which is essential for producing compelling and nuanced translations.
In multimodal texts, visual-verbal relationships can be complex and varied, contributing to the overall meaning differently. According to the taxonomy developed by Martinec and Salway (2005), visual-verbal relationships can be understood in terms of status and logico-semantic connections. Status refers to the hierarchical relationship between visual and verbal elements, which can be categorized into three types: (1) Complementary, Where images and verbal content work together to create meaning, neither dominating the other; (2) Hierarchical, where one mode (either visual or verbal) is dominant over the other, which plays a subservient role; (3) Independent: where images and verbal content function separately, each conveying different parts of the message.
Logico-semantic connections, on the other hand, refer to the logical relationships that can exist between visual and verbal elements. Examples include extension, where one mode extends the meaning provided by the other; enhancement, where one mode enhances the meaning of the other by adding additional information or emphasis; and elaboration, where one mode elaborates on the meaning provided by the other, offering more detail or explanation.
The review of multimodal analysis and AVT literature reveals several limitations in current research. A vital issue is integrating various semiotic modes; linguistic, visual, audio, and gestural, which often leads to separate monomodal analyses instead of a cohesive multimodal approach. Systemic functional analysis, for example, offers detailed breakdowns of multimodal texts but fails to produce comprehensive insights into multimodal meaning construction, leading to the “infinite detail” problem. Similarly, corpus-based methods encounter technological hurdles, particularly in aligning and annotating multimodal content in source and target texts. Translators also face difficulties adapting cultural elements like humor, politeness, and references, which require creativity to preserve the original intent.
Despite these advances, there remains a gap in applying multimodal pragmatic models to enhance cultural promotion through audiovisual content. This study addresses this gap by utilizing Dicerto’s (2018) multimodal pragmatic model to analyze Saudi audiovisual productions, particularly those that promote tourism and cultural traditions, such as Saudi coffee. Through the lens of Relevance Theory, focusing on optimal relevance and the distinction between explicit and implicit meaning, this research aims to provide a framework for maintaining semantic fidelity in translation and enhancing the effectiveness of these materials for international audiences. This approach can help translators navigate cultural references and ensure that the translated content conveys the depth and nuances of Saudi culture.
Theoretical framework and analysis model
The present study utilizes Dicerto’s (2018) proposed framework; the model builds upon the work of Barthes (1977) and Halliday and Matthiessen (2004), who identified different ways in which images and language can interact. The proposed model is a comprehensive framework for systematically analyzing multimodal texts and their translation. It is designed to address the complex organization of multimodal texts, their communicative possibilities, and their potential challenges in translation.
In Dicerto’s (2018) proposed multimodal pragmatic model, various dimensions of meaning are integrated to facilitate the understanding and translation of multimodal texts. It is grounded in Relevance Theory and is designed to help translators and researchers systematically analyze the complex interplay of multimodal texts’ verbal and visual (and potentially aural) elements. The model is structured around three main dimensions:
Multimodal pragmatic meaning
This dimension encompasses the overall context of the text and the sender’s communicative intention. It is informed by the semantic representation of the text and the recipient’s contextual and world knowledge, which help identify potential translation issues and strategies for resolving them.
Meaning in individual modes
The model assesses the contribution of each mode’s semantic representation to the text’s overall meaning. It acknowledges that images and language have equal status in a multimodal text and that each mode carries its semantic potential.
Interaction of the modes
This dimension examines the relationships between different modes within the text, such as visual and verbal elements. It considers how these modes interact to create meaning, including relations of equivalence, contradiction, complementarity, independence, and relations of expansion and projection.
Relevance-theoretic orientation
The relevance-theoretic orientation ensures the translation maintains the closest possible interpretive resemblance to the source text. This means that the translation should trigger cognitive effects similar to those in the target audience as the original does in the source audience, with an appropriate balance of effort and reward.
The model operates under the presumption of optimal relevance, which suggests that the sender of a message aims for it to be optimally relevant to the audience. The recipient uses this presumption to guide their interpretation, actively searching for the most relevant semantic representation and considering the processing effort required.
To facilitate the analysis, the model employs the concepts of “cluster” and “phase” to organize the multimodal text’s development over space and time. It allows the user to allocate textual resources and their related meanings at the correct point of textual development. The model is mainly an analytical tool for research, self-development, and teaching in translation studies. It is not designed for everyday translation work. However, it can be a valuable resource for raising awareness about the organization of multimodal texts and the challenges they may present in translation. The model provides a scalar and multidimensional approach to multimodal text analysis, emphasizing the importance of the interaction between different modes and the inferential meanings that arise from this interaction, all within the framework of relevance theory.
Relevance theory and the translation of multimodal texts
In the theoretical framework, Dicerto’s (2018) model identifies embedded meanings in source multimodal texts. Consequently, the relevance theory (RT) framework (Gutt, 2000) directs the selection of translation strategies and procedures to ensure accurate translation, emphasizing the principle of optimal relevance. The presumption of optimal relevance is a critical principle in Relevance Theory, which posits that communicators aim to be as relevant as possible within certain cognitive constraints, and communication recipients expect this effort in return. When applied to the interpretation of multimodal texts, this presumption guides the reconstruction of semantic representation in several ways:
-
Active interpretation: Recipients actively engage with the text, looking for meaningful connections between visual and verbal elements most relevant to their understanding.
-
Effort and reward balance: Recipients expect their effort in interpreting the text will be rewarded with sufficient cognitive effects, such as new insights, confirmation of hypotheses, or resolution of ambiguities.
-
Cross-media interaction: The presumption of optimal relevance leads recipients to explore potential interactions between semiotic modes (e.g., visual and verbal) that contribute to the text’s overall meaning.
-
Inference and context: Recipients use the context and their cognitive abilities to infer the intended meaning behind the multimodal elements, going beyond the explicit content to understand the implications and explanations suggested by the text.
-
Dimension of analysis: The presumption of optimal relevance informs the analysis of multimodal texts by highlighting the importance of considering the general dimension of analysis, visual-verbal relations, and the meaning of individual modes.
-
Semantic representation: Understanding the semantic representation of multimodal texts involves recognizing the status and logico-semantic connections between visual and verbal content, which are reconstructed based on the expectation of relevance.
-
Translation: For translators, the presumption of optimal relevance is crucial in ensuring that the translated text maintains the inferential meanings and the semantic representation of the original, allowing the target audience to achieve similar cognitive effects with comparable effort.
In summary, the presumption of optimal relevance is a guiding principle for analyzing and interpreting multimodal texts, ensuring that the semantic representation is reconstructed to balance cognitive effort with communicative reward. This principle is fundamental in translation, where the goal is to preserve the sender’s intended meaning and the inferential richness of the source text.
In practical terms, this means that the recipient uses the presumption of optimal relevance to reconstruct the semantic representation of the multimodal text, looking for logico-semantic relationships between visual and verbal content that shape the text’s overall meaning. The presumption of optimal relevance acts as a ‘communicative glue’ that binds together the different modes in a multimodal text, guiding the recipient to consider them as conveying a single message and to process them as interrelated components of a single textual unit.
The model suggests that the recipient aims to recognize the sender’s intention, inferred from the text and its context. This process involves categorizing multimodal meaning in terms of explicit and implicit meaning, with the explicit/implicit categorization helping the recipient recognize the sender’s intentions. The model also posits that while grammar aids in navigating the semantic representation of verbal messages, multimodal texts lack such grammar. Thus, the recipient must infer the logical form by the presumption of optimal relevance. This is demonstrated in (Table 1) below.
In this study, the researcher adapts Dicerto’s (2018) model, extracting elements pertinent to the study’s context, as illustrated in (Fig. 1):

Multimodal pragmatic analysis model.
This research assesses the viability of employing Dicerto’s multimodal pragmatic model in analyzing multimodal texts for translating culture-promoting materials. Three samples from a Ministry of Culture-produced video on “The Year of the Saudi Coffee” are analyzed qualitatively. This video strategically showcases Arabic coffee as emblematic of generosity and entrenched social customs in Saudi society, meticulously depicting each step of the coffee-serving process and its nuanced societal implications. Serving as a microcosm of broader cultural promotion endeavors, the video seeks to showcase traditions and engender global interest and comprehension. The qualitative analysis proceeds by delineating each level of the multimodal material, as illustrated in the model above (Fig. 1). Subsequently, this analysis informs the recommendation of optimal translation strategies to convey the original message’s intent in the translated (subtitled) version, specifically from Arabic into English, by the tenets of Relevance Theory.
Methods
Following the model elucidated in this investigation, the researcher chose a sample comprising three excerpts from a video promoting Saudi Coffee and its cultural traditions. The video analyzed in this study was produced by the Ministry of Culture in 2022 as part of a series of productions themed around “The Year of the Saudi Coffee.” This video was disseminated through multiple platforms, including television channels and various social media outlets such as X, Instagram, and others, to maximize its reach and impact. This strategic distribution was aimed at promoting Saudi cultural traditions and values to both domestic and international audiences.
The key variables
Multimodal elements
Linguistic (spoken language, subtitles), visual (images, gestures), audio (music, sound effects), gestural (body language, facial expressions), and spatial elements (layout, spatial organization) are analyzed to understand their contribution to the overall meaning of the video.
Cultural load
The cultural references and values depicted in the video, such as generosity, strong family ties, modesty, and the socialization of male children, are critical variables. These elements are analyzed to see how they are conveyed through different modes and how they can be effectively translated.
Promotional aspect
The video’s purpose as a promotional tool for tourism in Saudi Arabia is a significant variable. The analysis focuses on how the video introduces Saudi culture and values to attract international tourists.
Sample collection
The sample for this study is a 3-min video produced by the Ministry of Culture in 2022, themed around “The Year of the Saudi Coffee.” The selection of data for this research on AVT between Arabic and English is justified by several key factors. The video chosen is laden with cultural references that highlight traditional Saudi values such as generosity, strong family ties, modesty, and the socialization of male children. Analyzing these cultural elements is crucial for understanding how to translate cultural nuances effectively. Additionally, the video is part of Saudi Arabia’s Vision 2030 initiatives to boost tourism. It is a prime example of how audiovisual materials promote cultural heritage and attract tourists. As a representative sample of the type of audiovisual content used in cultural promotion efforts, this video is an ideal case study for testing Dicerto’s multimodal pragmatic model within the context of AVT. Moreover, the video utilized as data in this study was produced in Arabic. To the researcher’s knowledge, this video has not been translated through dubbing or subtitling to date. This aspect is significant as it ensures that the original content and context of the video remain intact, providing a genuine basis for the analysis conducted in this study.
Procedures
This research follows a qualitative descriptive analysis approach that involves a detailed examination of the multimodal elements within the video, including linguistic, visual, audio, gestural, and spatial components. The analysis begins with an overarching framework that establishes the general dimensions of analysis before delving into more specific components of the source text (ST) and their interactions, as illustrated in (Fig. 1). Then, Dicerto’s Multimodal Pragmatic Model is used to analyze the sender’s intention and the distinction between explicit and implicit meaning within the context of Relevance Theory. This model helps us understand how different modes of communication work together to create meaning and how this can be translated effectively, as explained in (Table 1).
Data analysis and results
Sample one
The initial segment (see Fig. 2) depicts a young man clutching a mobile phone in a domestic setting with his family. At the same time, his father announces that the forthcoming family gathering will take place at their residence. Subsequently, the young man exhibits anxiety, prompted by the responsibility of serving Saudi coffee to the guests (see Fig. 3). In response, the father engages the services of an expert to instruct his son in this task. This portrayal conveys to the audience that serving the coffee extends beyond mere generosity, encompassing a manifestation of social respect (Figs. 2 and 3).

Opening scene (Sample 1 image A).

Family gathering (Sample 1 image B).
The analysis
Table 2 above examines key analytical components in this sample, delineating semantic representations across the three modes within the multimodal text. Simultaneously, it elucidates the inferential meanings embedded in the multimodal text, revealing a distinct message from the sender: the significant role of serving coffee in the men’s majlis as a cultural practice in Saudi Arabia. This role is typically assumed by the host’s son, incorporating specific rituals to align with the cultural purpose and values associated with this practice in Saudi society.
The sender’s message in the multimodal text is communicated through three modes: verbal, aural, and visual, with each mode complementing the others to convey inferential meanings. Focusing solely on the verbal mode is advisable for translation purposes, as the remaining modes can remain unchanged. This approach is justified by the video clip’s intention to introduce Saudi culture and coffee traditions to a foreign audience, emphasizing traditional greetings with phrases and idioms as the host receives guests in the initial scene of the clip.
Translating the verbal mode necessitates thoughtful consideration to ensure optimal relevance in conveying the scene’s message, particularly due to the use of Najdi Saudi dialect rather than standard Arabic. Employing a foreignization strategy, incorporating culturally significant terms like “gahwa” and “azima,” enhances the video’s cultural resonance, fostering audience awareness of Saudi culture. The visual mode supplements comprehension by visually representing the meanings associated with these vocabulary items.
Sample 2
The video segment (see Fig. 4) captures the initiation of the training journey, featuring an expert, a Bedouin man, alongside the host’s son. The expert, donned in traditional attire, instructs the young man, who wears a thobe without a head cover. The focus is on the expert’s explanation of the rules and rituals of coffee service, emphasizing the initial steps of body posture, pot, and cup positioning.

Lesson in serving Saudi coffee (Sample 2 image A).
Table 3 above systematically examines key analytical components within the sample, delineating semantic representations across the three modes in the multimodal text. It concurrently elucidates inferential meanings, conveying a distinct message regarding the crucial role of correctly serving coffee in men’s majlis. The initial step involves the precise positioning of the body and kit, with specific emphasis on serving the cup with the right hand—a gesture symbolizing a cheerful greeting, aligning with the common Islamic practice of using the right hand for eating and drinking.
The sender’s message is conveyed through verbal, aural, and visual modes, each complementing the others to convey inferential meanings. In translation, prioritizing the verbal mode is recommended, justified by the video’s aim to introduce Saudi culture and coffee traditions to a foreign audience. Translating the verbal mode requires careful consideration, mainly due to using the Najdi Saudi dialect rather than standard Arabic. Adopting a literal translation strategy in this context clarifies the importance of using the right hand for serving, enhancing the video’s cultural resonance, and fostering audience awareness of Saudi culture.
Sample 3
In the multimodal analysis of cultural translation, the video segment (see Fig. 5) exemplifies a significant aspect of Saudi coffee service, where an expert imparts knowledge to a younger individual. The visual representation underscores the cultural intricacies involved, emphasizing the meticulous practice of filling the finjal, a small cup, up to a precise one-third level, thereby contributing to a nuanced understanding of the cultural dimension.

How much coffee to serve? (Sample 3 image A).
Within the majlis, a visual aid in the form of a blackboard prominently displays an illustration delineating the precise one-third filling level of a finjal. Concurrently, the expert imparts knowledge to the young individual, who diligently takes notes, accentuating the intersection of visual and verbal modalities in conveying the meticulous nuances of serving Saudi coffee.
The analysis in Table 4 above systematically explores critical analytical components in the sample, delineating semantic representations across three modes in the multimodal text. It also clarifies inferential meanings, conveying a specific message about the pivotal role of serving coffee in precise portions in men’s majlis, symbolizing a positive greeting and generosity.
The sender’s communication employs verbal, aural, and visual modes, complementing the others to convey inferred meanings. In the translation process, given the video’s goal of introducing Saudi culture and coffee traditions to a foreign audience, prioritizing the verbal mode is advised. Translating the verbal mode demands careful consideration, mainly due to using the Najdi Saudi dialect rather than standard Arabic. In this context, a literal translation strategy emphasizes the significance of serving the right amount of coffee, enhancing the video’s cultural resonance, and promoting audience awareness of Saudi culture.
The model for analyzing multimodal texts contributes to understanding semantic representation in translation by providing a systematic framework for examining the complex interplay of various communicative modes and their combined effect on meaning. According to the model, multimodal texts convey the sender’s meaning similarly to verbal monomodal texts, suggesting that a semantic representation can trigger the retrieval of explicates and implicatures from both visual and verbal content. This implies that the translation of a multimodal source text (ST) needs to interpretively resemble the original, maintaining a similar semantic representation that suggests the retrieval of similar explicatures and implicatures.
Furthermore, the model aids in identifying potential translation challenges and solutions, which can significantly impact the communicative organization of multimodal target texts (TTs). Translators can better understand and preserve the sender’s intention in the translated text by systematically analyzing the semantic representation of multimodal texts, including the meaningful interaction among modes and the contribution of individual modes to the overall meaning.
In summary, the model provides a comprehensive approach to understanding and preserving the semantic representation of multimodal texts in translation. It ensures that the sender’s meaning, as conveyed through a combination of communicative modes, is accurately interpreted and rendered in the target language.
Discussion
The analysis conducted in this research underscores the efficacy of Dicerto’s (2018) multimodal pragmatic model in aiding translators to achieve optimal relevance when handling culturally rich multimodal materials. The model facilitates a focused transfer of elements into the target language by comprehensively understanding each semantic representation across various modes in the original video while preserving message clarity.
The model contributes significantly to understanding semantic representation in translation by offering a systematic framework for examining the intricate interplay of communicative modes and their collective impact on meaning. This aligns with the principles of Relevance Theory, which emphasize the importance of optimal relevance in ensuring that the translated text is maximally pertinent to the target audience, justifying their cognitive effort in processing it and facilitating access to the sender’s intentions.
Like Martinec and Salway’s (2005) taxonomy on visual-verbal relationships, the model suggests that multimodal texts convey meaning akin to verbal monomodal texts, necessitating the preservation of a similar semantic representation to ensure the retrieval of comparable explicatures and implicatures. This approach is crucial in maintaining the coherence and effectiveness of the translated message, especially in culturally loaded contexts such as promoting Saudi coffee traditions.
The model assists in identifying potential translation challenges and solutions, thereby influencing the communicative organization of multimodal target texts. For instance, the challenges posed by cultural nuances, such as humor, politeness, and cultural-specific references, are well-documented in AVT research. Studies by Chiaro (2010) and Altahri (2013) highlight the complexities of translating these elements, which require careful adaptation and creativity to maintain the original message and effect. Dicerto’s model, by integrating various dimensions of meaning, provides a structured approach to navigating these challenges, ensuring that translators can better grasp and maintain the sender’s intention in the translated text.
The use of transcreation as an adaptation process, as discussed by Alawad and Alosaimi (2023), is further validated by this study. Transcreation involves creatively reinterpreting the content while preserving the meaning, which is particularly useful in addressing cultural problems in subtitling. The model’s emphasis on understanding the sender’s intention and the distinction between explicit and implicit meaning within the context of Relevance Theory supports the application of transcreation. This approach ensures that the original message, image, effect, and style are effectively conveyed to the target audience. It is a valuable tool for subtitlers dealing with culturally rich multimodal materials.
Limitations of the study
Despite the significant contributions this study makes to AVT by employing Dicerto’s (2018) multimodal pragmatic model, several limitations must be acknowledged. While detailed and insightful, the qualitative descriptive analysis approach lacks the generalizability and quantifiable data that a quantitative or mixed-methods approach could provide, limiting the study’s ability to draw broad conclusions applicable to all types of multimodal texts. Furthermore, the analysis is confined to a single 3-min video produced by the Ministry of Culture in 2022, focused on “The Year of the Saudi Coffee,” which, although representative of Saudi cultural promotion efforts, may not fully capture the diversity of audiovisual materials used in such contexts. A more extensive and diverse sample would be necessary to provide a more comprehensive understanding of the model’s applicability. In addition, one of this study’s limitations is that the video data used was produced in Arabic and, to the researcher’s knowledge, has not been translated via dubbing or subtitling. This could restrict the accessibility and broader applicability of the findings, as they are based on content that remains in its original language.
From a technological and analytical perspective, the study highlights the challenges in integrating various semiotic modes such as linguistic, visual, audio, gestural, and spatial elements. The use of systemic functional analysis, while systematic, can be hampered by the “infinite detail” issue, where detailed descriptions of multimodal text components fail to generate overarching conclusions about multimodal meaning construction. This makes it challenging to integrate metafunctional analyses of distinct modes seamlessly.
The study is also contextually and culturally specific, focusing on Saudi Arabian culture and promoting Saudi coffee traditions. This specificity provides deep insights but limits the study’s direct applicability to other cultural contexts or types of audiovisual materials. Additional research is required to test the model’s applicability across different cultures and contexts. Moreover, while the study acknowledges the challenges in translating cultural references, humor, and politeness, it may not exhaustively explore all potential translation challenges inherent in the polysemiotic nature of audiovisual texts, such as the constraints imposed by character limits and time constraints for subtitles in the audiovisual realm.
Conclusion
The concept of multimodality underscores the intricate nature of human communication, highlighting the utilization of diverse semiotic resources in navigating the complexities of interaction. This phenomenon is particularly pronounced in our digital era, where multimodal texts have become increasingly prevalent. A compelling case study exemplifies this trend in Saudi Arabia’s Vision 2030 initiative, which prioritizes promoting Saudi culture and values globally. In response, a concerted effort has been made to produce diverse audiovisual materials, serving as dynamic tools for showcasing the richness of Saudi traditions on a global stage. This research contributes a descriptive analysis model, employing Dicerto’s (2018) subtitling framework to advance the promotion of Arabic culture through audiovisual content. By adopting a multimodal approach and focusing on translation strategies within the Saudi Arabian context, this study aims to optimize the impact of such materials on international audiences. The data collected herein offered a representative sample specific to this endeavor, providing valuable insights into the intricate dynamics of cultural representation and translation within the digital landscape.
This study demonstrated that Dicerto’s (2018) model primarily serves as an analytical instrument for scholarly investigation, personal growth, and pedagogical purposes within translation studies. Although it is not tailored for routine translation tasks, it holds significant potential for fostering understanding regarding the structuring of multimodal texts and the complexities inherent in their translation. Offering a nuanced and multifaceted perspective, the model adopts a scalar and multidimensional approach to analyzing multimodal texts, underscored by the pivotal role of intermodal interactions and their inferential meanings, all within the theoretical framework of relevance theory.
In conclusion, this study underscores the significance of employing Dicerto’s (2018) multimodal pragmatic model in analyzing and translating culturally rich audiovisual productions, particularly in promoting Saudi Arabian culture. The research highlights the complexities of multimodal communication and the model’s efficacy in achieving semantic fidelity by integrating various dimensions of meaning grounded in Relevance Theory. However, to further enhance the model’s applicability, future research should explore its effectiveness across diverse multimodal source texts and in professional translation practice across various language pairs and contexts. Testing the model in real-life translation scenarios, such as subtitling children’s animated films or dubbing complex jokes, as discussed by Yanti (2022) and Camilli (2019), can validate its efficiency and efficacy. Additionally, integrating the MPA methodology proposed by Mubenga (2009) could provide a more structured framework for analyzing film discourse, offering a nuanced understanding of the complexities involved in translating the multimodal nature of film texts. This combined approach would ensure that verbal, visual, and acoustic elements are comprehensively examined and translated effectively, significantly advancing AVT research and practice.
Responses