CATS: cultural-heritage classification using LLMs and distribute model

AdMaPlace March 16, 2025

0 Comments

CATS: cultural-heritage classification using LLMs and distribute model

Introduction

Recent rapid advancements in artificial intelligence and machine learning technologies have dramatically improved the performance of language and image generative models. Particularly, generative models have greatly expanded their utility by learning from vast amounts of digital data and transforming it into actionable forms. These models play a crucial role not only in classifying data but also in understanding data distributions beyond specific labels. In cases where information is extremely scarce making classification through text alone difficult, generative models can provide visual materials as outputs. Furthermore, in scenarios where textual information is absent entirely, automatic classification can still be achieved solely through images. We attempted to measure the correlation and similarity between cultural heritages through images using the latest technique, Stable Diffusion. However, we encountered a problem where Stable Diffusion struggles to recognize the classical language of Korean. Despite the Stable Diffusion being fine-tuned in Korean, it has difficulty understanding and processing classical Korean sentences properly. This issue complicates the analysis of the correlation between cultural heritages through generated images using Stable Diffusion. This is because the model needs to generate images similar to actual cultural heritages for it to learn effectively. The process described is illustrated in Fig. 1, which depicts the workflow from classical Korean text input to image generation. Interestingly, we found that when we translated classical Korean sentences into English using LLM and input them into the Stable Diffusion, the model generated images similar to the original. This suggests that the Stable Diffusion model has a higher recognition rate for English sentences. Therefore, it implies that even for classical Korean sentences, we can visualize the meaning of the original similarly through English translation.

The application of generative models can be extended to systems managing extensive digital data sets such as cultural heritage. Currently, efforts are underway to digitize accumulated cultural heritage, necessitating systematic classification and management systems due to the vast and varied information these heritages contain. Depending on the characteristics and significance of each heritage, management approaches vary, from basic information as shown in Fig. 2a to more detailed descriptions as seen in (b). Even for cultural heritages with different amounts of information, a system that applies various modern techniques is needed to automate the classification of cultural heritages.

**Fig. 2: Information panel⁵⁶ comparison.**

Figure 3 is an archive-based visualization of associations. Currently, data is manually organized and organized without the help of AI which needs to be improved to organize a large amount of digital cultural heritage. This process is complex and time-consuming and has limitations in accuracy and efficiency. Therefore, it is essential to build an automated system using existing organized information for an accurate classification system.

As a result, cultural heritage data created by making the most of text information can be used in various ways. An example of 3D modeling is shown in Fig. 4, which converts visual data into 3D. Digitized cultural heritage data can be used to arrange exhibit spaces in museums analyze data using 3D materials, and create visual information for users. Therefore, this paper provides the following contributions:

1. We present a method for generating and using visuals using generative models in the digitization process of cultural heritage. We utilize Stable Diffusion, a generative model, to visually represent and reconstruct cultural heritage textual data. This generates augmented data that can be used to improve the performance of learning models for association analysis.

2. We propose a method for automatically classifying and managing large-scale cultural heritage data by applying Multi-label Classification to a cultural heritage management system. It combines class names and image features to distinguish various attributes of cultural heritage, reducing the cost and time of manual classification and improving management efficiency.

3. We use a large-scale language model to supplement the missing textual information of cultural heritage with a large-scale language model for cultural heritage that lacks textual information. This solves the problem of lack of textual data and adds diversity to visualization generation.

Related work

Tasks with Cultural Heritage

Existing cultural heritage data suffers from unstructured applications and insufficient information, which is a challenge for various institutions managing cultural heritage. Recent research utilizing cultural heritage data has been solving various problems based on deep learning algorithms. To address this¹, proposed a method to accelerate the process of recognizing historical architectural elements in detailed features and developing historical architectural information modeling (HBIM)² from data using semantic segmentation³ of 3D point clouds⁴. DeFi⁵ acquires 3D data through 3D sensors or multi-view reconstruction for the preservation of digital cultural heritage and proposes a method for learning hole boundary detection by generating synthetic datasets to form accurate 3D point clouds. ConvSRGAN⁶ proposes a method for the restoration and enhancement of traditional Chinese paintings by applying Enhanced High-Frequency Retention Modules which utilize residual blocks and a high-frequency emphasis loss function to maintain high-frequency components and restore detailed features in the images. MROPM⁷ uses photo-realistic style transfer (PST) to style an old photo using multiple references and then enhances the result to create a modern image. ArchGPT⁸ proposes a method that focuses on customized tasks for the repair and preservation of traditional buildings by utilizing large-scale language models (LLMs). Specifically, it proposes to apply RAG (Retrieval Augmented Generation) technology to analyze the damage status of buildings through image recognition, information retrieval, and image rendering, and to derive appropriate restoration methods based on the results. While various studies have been conducted to apply these deep learning techniques to the cultural heritage field, no specific results have been reported using the text-to-image generative model Stable Diffusion⁹. Therefore, we demonstrate how Stable Diffusion can be used to process cultural heritage data and extract meaningful patterns and relationships from it.

Large Language Models

LLM models play an important role in processing and generating textual data. First, the field of natural language processing can be categorized into three main types. They are encoder type, decoder type, and encoder-decoder type based on the transformer structure. The first model of the encoder type Bidirectional Encoder Representations from Transformers (BERT)¹⁰ was proposed to predict masked tokens in text and to determine the probability of one text passage following another. It includes methods for masked language modeling (MLM)¹¹ and next sentence prediction (NSP). The innovation of BERT has influenced the development of many other models one of which is the Robustly optimized BERT approach RoBERTa¹². RoBERTa builds on BERT emphasizing that modifying the pre-training method by training for longer in larger batches with more training data leads to better performance. It significantly improved performance over the original BERT model by not including the NSP task. DistillBERT¹³ proposed a faster and less memory-intensive method by applying knowledge distillation to improve the drawback that BERT is difficult to deploy in environments that require low latency due to the size of the model. ALBERT¹⁴ made three major modifications to the encoder structure to make it more efficient. They reduced the embedding dimensionality by separating the token embedding dimension from the hidden dimension, reduced the actual number of parameters by having all layers share the same parameters, and changed the NSP goal to sentence order prediction. ELECTRA¹⁵ was proposed to address the constraint that the standard MLM pre-training necklace only updates the masked token representation at each step, while the other input tokens are not updated. It uses two models, one to predict the masked tokens and the other to predict through a discriminator, to improve training efficiency. DeBERTa¹⁶ proposed that by separating token content and relative position, the self-attention layer better models the dependence of pairs of adjacent tokens. By adding absolute position embeddings just before the softmax layer of the token decoding head, it became the first model to outperform humans on the SuperGLUE¹⁷ benchmark. The next type of decoder is represented by GPT models. Decoder models excel at predicting the next word in a sentence and can therefore be used for most text-generation tasks. GPT was trained to predict the next word based on the previous word by combining a transformer decoder architecture with transfer learning. Its successor, GPT-2¹⁸, is a model built by extending the training set of the original model, and is excellent at creating coherent long sequences of text. CTRL¹⁹ is an improvement on GPT-2 in that it offers little control over the style of the sequences it generates. By adding a control token at the beginning of the sequence, we can control the style of the generated sentence, allowing us to generate different sentences. GPT-3²⁰, which grew out of the success of GPT-2, analyzed the behavior of different language models to discover rules for computation, dataset size, model size, etc. This model not only produces highly realistic text passages, but also performs well in supervised learning. The last type of transformer architecture, the encoder-decoder type, is represented by T5²¹. T5 unifies all the work of NLU²² and NLG²³ in a text-to-text conversion. It uses the original architecture of the transformer and has many versions with different parameters. BART combines pre-trained models from BERT and GPT in an encoder-decoder architecture. The input sequence is transformed and passed through an encoder, and the decoder reconstructs the original text.

LLaMA²⁴, a recent high-performing model, follows the same technique used in GPT-3 by normalizing the inputs of each Transformer sublayer to increase the stability of the training. Performance is optimized by removing absolute position embeddings and instead applying Rotary Positional Embeddings (RoPE)²⁵. Alpaca²⁶ is a finetuned version of LLaMA that can generate large instruction datasets by applying self-Instruct²⁷, a method that helps LLMs improve their ability to follow human instructions. We show that the combination of a pre-trained LLM can create a small instruction-following model that can match the performance of LLMs at a low cost. In this paper, we chose the model of GPT 3.5, which is both search-based and conversational, to create appropriate prompts for cultural heritage metadata. We found that it is easy to refine the desired prompts and provides a highly accurate English description of the original prompts.

Diffusion Models

Generative models are techniques that sample a dataset into a trained model to create new realistic images that do not exist in the original dataset. Among the methods of generative models, VAE²⁸, a structure of autoencoders²⁹ that utilizes latent variables, has emerged. It was proposed as a mapping to a multivariate normal distribution around a point in latent space. The later GAN³⁰ uses a constructor that serves the same purpose as the decoder in a VAE, converting vectors in the latent space into an image. The discriminator predicts whether the image is real or fake and produces an output through a convolutional layer. The idea is that the constructor converts random noise into samples that appear to be sampled from the original dataset, and the discriminator predicts whether they are from the original dataset or a forgery of the constructor. Rather than training directly on full-resolution images, ProGAN³¹ trains the generator and discriminator on low-resolution images of 4×4 pixels and then incrementally adds more to the training process to increase the resolution. StyleGAN³², which is based on ProGAN, injects style vectors into the neural network at various points to improve the difficulty of distinguishing latent space vectors corresponding to high-level attributes. StyleGAN2³³ used a method of removing the AdaIN³⁴ layer of the constructor and replacing it with a weighted modulation and demodulation step to improve the quality of the generated output. The paper showed how to eliminate artifact problems while maintaining control over the style of the image. SAGAN³⁵ applied the self-attention mechanism of transformers to solve the problem that convolutional feature maps only process local information. BigGAN³⁶ proposed a method that uses the same distribution of latent vectors for training but uses a truncated normal distribution for sampling to increase the reliability of the generated samples. VQ-GAN³⁷ creates a codebook, a list of trained vectors, and uses the codebook vectors associated with the corresponding indexes. ViT VQ-GAN³⁸, an extension of VQ-GAN, proposed a neural network structure applied to image data by replacing the convolutional encoder and decoder with transformers. The image is divided into a series of patches, which are tokenized and used as input, and the resulting embedding is quantized according to a learned codebook. The decoder-transformer is then processed with an integer code and the image is formed from the sequence of patches.

Generative models evolved with the introduction of the diffusion model, DDPM³⁹. DDPM is a method for training a deep-learning model to remove noise from an image in a continuous fashion. In a forward diffusion process, the image is progressively corrupted to make it indistinguishable from standard Gaussian noise. The backward diffusion process removes the noise from the random noise to produce the output image. DDIM⁴⁰, a refinement of DDPM, is an image generation method that allows for much faster sampling than traditional diffusion-based generation models. It is a variant of the traditional diffusion-based generation model that uses implicit sampling to accelerate the image generation process. This model has the advantage of generating high-quality images quickly, while maintaining the performance of the original diffusion model. Text-to-Image generative model is a method of generating images from a given text prompt. A representative of the text-to-image generative model, LDM⁹, is a proposed method that wraps a diffusion model with an autoencoder so that the diffusion process operates on a latent space representation of the image rather than the image itself. The autoencoder encodes the image details into the latent space and decodes the latent space back into the high-resolution image, which is not good for speed and performance. LDM significantly improves the speed and performance of the training process because it only works on the conceptual latent space.

DALL-E⁴¹ proposes to model the text-to-image generation task auto-regressively by utilizing transformers from a single data source. DALL-E2⁴² does not train a text encoder, but uses a pre-trained CLIP⁴³ as an encoder and generates the final image by embedding the image output by text prompts. Flamingo⁴⁴ is proposed to use a visual encoder that encodes visual input features into a small number of visual tokens using VLM⁴⁵ techniques and a persistent resampler to pass the visual information to the transformer for use. ControlNet⁴⁶ proposes a way to control the output image using a canny edge map on the input image. It is configured to fine-tune Stable Diffusion⁹ using a small number of images. Stable Diffusion is a latent diffusion model technique that operates in the latent space of the autoencoder rather than the image itself. Rather than directly predicting pixel values, the diffusion model predicts the compressed latent embedding from the autoencoder, which has the advantage of reconstructing the image based on a more abstract representation of the data rather than a pixel-by-pixel prediction. SDXL² uses a U-Net structure⁴⁷ that is three times larger than the original Stable Diffusion, significantly enhancing high-resolution image generation performance. It effectively handles various resolutions and aspect ratios by incorporating image size and crop conditions into model training. Additionally, it refines generated images to greatly improve their detail and resolution. Openjourney⁴⁸ was developed with the background of enabling complete control over the generated content. It is an open-source model adjusted from Stable Diffusion version 1.5. Imagen⁴⁹ shows that using T5, where the encoder is pre-trained on text only, has a greater impact on overall performance than scaling the decoder of the diffusion model. They also show that the generated image is output in super-resolution using a super-resolution upsampler model. Prompt-to-Prompt⁵⁰ proposes a method for semantically editing specific regions of an image in the text-to-image generative model by simply manipulating text prompts. By controlling which pixels are attended to which tokens in the prompt text during the diffusion step, it is possible to inject a cross-attention map during the diffusion process to edit the image. BK-SDM⁵¹ A model that removes residual blocks and attention blocks from Stable Diffusion and reduces the model size through knowledge distillation. It can be used at low cost by compressing the existing stable diffusion and simplifying the model structure.

While these models have the ability to effectively process and generate text and image data, they present a number of challenges, especially when dealing with non-English languages such as Korean. We found that models for generating images from Korean text perform poorly compared to English models. This is because even models trained on Korean have less training on historical content, and the text-to-image conversion is less complete when using Korean words verbatim. Korean also has a different grammatical structure and vocabulary than English, which requires additional training for models to handle it effectively. One of the main problems with processing Korean is the mix of classical words and foreign words (e.g., Chinese characters, Japanese, and English). It is very difficult to accurately understand and process such text, which can lead to poor model performance. To improve this, in this study, we tried an approach to convert Korean text to English. This approach helps the model perform better while preserving the meaning of the text. The main goal of this paper is to compare and analyze various LLMs and image generation models, and to present the problems they encounter in processing Korean and classical words and their solutions. By doing so, we hope to provide users with better performance and convenience, and explore the possibility of implementing them at a lower cost. In particular, we propose that the English conversion approach can effectively solve the problems encountered in Korean processing.

Methodology

In the case of cultural heritage with little textual information, it is difficult to classify using text alone, so it is possible to utilize a generative model represented visually. As such, the management system of cultural heritage data has been classified and managed mainly by text, but this study proposes a new method of generating images and classifying them through them.

Figure 5 shows the overall process of this study. This process enables managers and users of difficult cultural heritage information to visually understand and utilize cultural heritage information through images. In addition, even if the amount of information is small, the generated images can be used to find new classification associations.

Problem Setting

Prompt Refining

Each cultural heritage item commonly possesses information such as name, material, and historical context, and this information was used to construct prompts. In this study, the results of using Korean and English were compared and analyzed; the English descriptions generated images with more consistency and higher utility as visual materials. Specifically, using Korean directly produced very similar results even between unrelated meanings. As the historical materials included foreign languages besides Korean, better results could not be obtained. Therefore, an attempt was made to unify these into English prompts. This process can be seen in (i) of Fig. 5. Traditional Korean languages posed challenges for the latest LLMs, which could not understand them. Therefore, the GPT 3.5 is used, as it can easily translate word corpus into English descriptions.

Text-to-Image Generator

In this study, we analyzed various text-to-image generative models including Openjourney, DALL-E2, DALL-E3, SD 1.5, SD 2.1, and SDXL 1.0. Initially, images were generated using models fine-tuned in Korean to directly use Korean text. However, we observed that these models generated images that were not variations of similar patterns but entirely different content for the same input. This issue includes the fact that models trained in Korean lack sufficient learning of historical content to generate appropriate sentences. The Korean fine-tuned Stable Diffusion model demonstrated fast processing speed and cost efficiency but tended to generate unrelated images when expressing classical words, making it unsuitable for multi-label classification tasks. Therefore, we translated cultural heritage information containing classical words into English and compared text-to-image generative models based on this input. Among them, DALL-E 3 and SDXL 1.0 particularly excelled in generating images suitable for English descriptions. However, DALL-E 3, despite its excellent performance, faces difficulties in mass image generation due to its closed-source nature. In contrast, SDXL 1.0, being open-source, successfully generated visually distinct and excellent feature images.

Considering that semantic visualization results are more important than visual quality in search systems, we employed the open-source and high-performance SDXL 1.0 for image generation. The process for generating images from textual descriptions is formulated as follows:

$$I=f({D;}theta )$$

(1)

where I denotes the generated image, D represents the input sentences, and θ are the parameters of the model. We use SDXL 1.0, which is open source and has good performance and processing speed, and find that it produces visually distinguishable and good features. Since the performance of the search system is important in this study, the semantic visualization results are more important than the visual quality of the images.

Multi-Label Classifier

Multi-label classification was used to enable efficient classification and search in the image search system. The dataset used is structured in the MSCOCO⁵² format, with each image being assigned multiple labels. These labels are categorized according to super categories. As shown in Fig. 6, a model for multi-label classification can be applied to a retrieval system by simultaneously predicting classes belonging to each supercategory. Specifically, when imaging only the textual information of a cultural property, each image is multi-label classified to reveal the subclasses predicted by the model in each category. From these results, the model learns which features are similar between cultural heritage sites whose textual information has been imaged, and this allows it to find and connect similar cultural heritage sites. As a result, the system is able to analyze the associations between images and provide recommended images, which can be used to visually understand the relevance of cultural heritage information.

Figure 7 shows the structure of multi-label classification with a transformer-based architecture⁵³ for each super-category. This structure consists of two main streams: Spatial Stream and Semantic Stream. The Spatial Stream uses a Vision Transformer (ViT) to process the cultural heritage image data generated in the previous step. It divides the image into patches, extracts the visual features of each patch, converts them into vectors, and encodes them into a learnable form. The visual information in the image is processed through a transformer model, which is important for the classification process. The Spatial Stream uses a Vision Transformer (ViT) to process the cultural heritage image data generated in the previous step. It divides the image into patches, extracts the visual features of each patch, converts them into vectors, and encodes them into a learnable form. The visual information in the image is processed through a transformer model, which is important for the classification process. Semantic Stream uses BERT, a large-scale language model, to process textual information. The textual data as input contains the actual multi-label classes of the cultural heritage to be matched with the generated images. The stream learns by matching images generated from cultural heritage textual information with the actual categories of the cultural heritage. Specifically, it is trained by matching the generated images with the original textual information (category information: material, age, lifestyle). It analyzes the information between text and images, and understands the context and semantic relationships between labels. The two streams function independently, collecting visual and textual features respectively, which are then merged through a convolutional layer.

During this process, visual and text features are combined to prepare the data for final classification. The transformer encoder-decoder architecture then processes this combined data to produce the final classification result. This architecture is used to predict labels by considering the visual features of the image patches together with the semantic information in the text data. The process of connection and classification is formulated as follows:

$$P={Softmax}({rm{W}}_{rm{h}},*, ({Concat}({rm{f}}_{rm{s}},,{rm{f}}_{rm{t}}))+{b})$$

(2)

f_s and f_t and represent the feature vectors extracted from the spatial and semantic streams, respectively. W_h and b denote the trainable weights and biases. The Concat function concatenates the feature vectors from the two streams. The Softmax function converts each element of the final output vector into probabilities, presenting the label with the highest probability as the final prediction result.

Dataset

The dataset we will introduce is MUCH (Multi-purpose Universal Cultural Heritage), which is augmented with 9600 images of Korean cultural heritage, totaling 96,000 images. Of these, 86,400 images were used for training and 9600 images were used for validation. The dataset was created by processing data provided by the National Museum of Korea. The dataset is categorized into three super-categories: age, life, and material, with a total of 32 classes. We used name, era, and material information to conduct experiments with minimal information from the metadata. The difficulty of categorizing non-distinct objects comes from supercategories such as age, lifestyle, etc. Age has 11 classes, Lifestyle has 7 classes, and Material has 14 classes.

In Table 1, the classes within the subcategories of super categories are shown. Age includes classes categorizing age in Korea: Bronze Age, Early Iron Age, Proto-Three Kingdoms, Baek-je, Silla, Three Kingdoms, Unified Silla, Go-ryeo, Late Joseon, Joseon, Japanese colonial period. Material refers to the surface or texture of substances: wood, stone, soil, paper, mineral, fossil, seed, lacquer, leaf, leather, bone, fiber, ceramic, rubber. Last, Life-style represents the background of industrial and other age: transportation/communication, culture and art, social life, industry/livelihoods, dietary life, clothing life, daily life.

Table 1 Classification Based on Korean Age, Materials, and Lifestyles Using the MUCH Dataset

Full size table

Results

Implementation Details

Table 2 shows the key parameters used in the multi-label classification model. To evaluate the proposed approach, we cond ucted experiments using MUCH. We used the average precision (AP) for each category and the mean average precision (mAP) as evaluation metrics. We also measured the F1-measure (OF1), which indicates the performance of the model. We utilize a transformer-based multi-label classification model. To maintain a balance between detail capture and computational efficiency, all images are normalized to a resolution of 448 × 448. The AdamW optimizer⁵⁴ is employed with a learning rate of 1 × 10⁻⁵ and a weight decay of 1 × 10⁻⁵. For the loss function, binary cross entropy(BCE)⁵⁵ is adopted as it is well-suited for multi-label classification.

Table 2 The summarize the key parameters used in Multi-Label Classification Model

Full size table

Experiment Results

Quantitative results

Table 3 shows the performance results. The MUCH we used consisting of 9600 images before augmentation, yielded high-performance results. Analyzing the performance of each super category reveals the following characteristics.

Table 3 Experimental results on datasets showing a performance (mAP in %)

Full size table

Firstly, the mAP value in the ‘Age’ category is relatively low at 45.7%. This is likely due to the difficulty in classifying age in Korea primarily based on visual information alone. Clear visual clues for distinguishing between different eras may not be readily apparent, leading to decreased classification accuracy. The mAP value for the ‘Life style’ category is also low at 44.4%. This is because the concept of lifestyle itself is challenging to define with distinct visual objects. Moreover, with only 7 classes, fewer than other categories, and a limited dataset, there may not have been sufficient training. Conversely, the ‘Material’ category achieved the highest performance with an mAP value of 78.8%. This is attributed to the ease of visually distinguishing between different types of materials. Characteristics such as color and texture provide clear visual clues, enabling high accuracy in image-based classification. Thus, the differences in mAP values across each super category highlight how visual clarity of the classification targets, dataset size, and number of classes can significantly influence performance. If criteria are established for abstract terms such as Age, Material, and Life style, and distinguishable object classes are added, MUCH can expect good performance. Therefore, the performance of mAP 56.3%, OF1 50.1% is not a low score by any means. If we increase the amount of datasets, it will outperform the comparison dataset.

To evaluate the effectiveness and user experience of the proposed system, we collected feedback from users. Users gave their opinions on the system, both positive and negative, which helped us to understand the system’s strengths and what needs to be improved. Table 4 shows the feedback from cultural heritage professionals. Feedback has been anonymized by user number. For each item, usability evaluates the intuitiveness of the user interface and ease of use, while speed evaluates the response time or data processing speed of the system. Accuracy evaluates the accuracy or reliability of the results provided by the AI system, and scalability evaluates whether the system has the potential to evolve in the future and is sustainable.

Table 4 User Feedback from Cultural Heritage Professionals on System

Full size table

As shown in Table 4, the majority of users gave the system positive ratings for most of these evaluation criteria, but also identified the need for a high level of accuracy in order for the system to become popular. This feedback will be used as an important reference to set the direction for future system improvements.

Qualitative results

Figure 8 compares the output results from different LLMs. Alpaca, for instance, produces results that diverge significantly from the original meaning of the input prompt. While Llama demonstrates good performance, it suffers from slower speeds and occasional inconsistencies in output. On the other hand, GPT-3.5 consistently provides accurate results with rich English descriptions based on search queries. Given the importance of high accuracy for generating correct images in this study, GPT-3.5 was chosen.

Figure 9 illustrates the translation of Korean sentences containing classical words into English using GPT-3.5. The highlighted portions indicate where Korean has been translated into English. It demonstrates high performance in seamlessly translating Korean into English, even when classical words are included. The model adeptly grasps the context of sentences and generates appropriate English expressions. Thus, GPT-3.5 was refined using prompts to achieve these results.

Figure 10a represents a coin, and the result from the SDXL 1.0 model appears most similar to the actual image, closely matching its texture representation as well. While other models also generated similar images, SDXL 1.0 stands out for its fidelity in texture representation akin to the real image. Figure 10b depicts celadon, showing a creation closely resembling the original overall. Figure 10c features a porcelain lamp, accurately capturing the characteristics of a lamp or porcelain. Overall, considering the generated image results alongside time and computational costs, SDXL 1.0 proves suitable for data generation and augmentation tasks. This model not only contributes to high performance but also offers practical convenience for users and administrators focusing on real-world applications.

Figure 11 is a comparison between the image generated through the proposed method and the actual image. The amount of metadata information for each cultural heritage is very different. However, it can be seen that the image generated after switching to the English description is generated very similar to the actual image. It is generated so similarly that it is difficult to distinguish the generated images of (a), (b), and (c) of Fig. 11 from the original image. Using these images for the model to learn will also help automatically classify newly added cultural heritage.

**Fig. 11: Compare images generated using SDXL 1.0 (top) with actual cultural heritage images (bottom).**

Figure 12 shows a comparison of automated and manual processing of a cultural heritage search and recommendation system. Figure 12a shows a GUI environment where a model trained on user input data analyzes images and automatically classifies and recommends appropriate categories based on the evaluation results. Figure 12b shows the cultural heritage information page of an e-museum site operated by the National Museum of Korea, showing the same items as in Fig. 12a. This page requires all images to be uploaded and organized manually. If an automated approach like the one in Fig. 12a were to be adopted, the time and effort spent on manual work could be significantly reduced, and costs could also be reduced. The system can provide results at a practical level because it automatically analyzes the various textual information of each cultural property and classifies them into appropriate categories.

**Fig. 12: Examples of a search and recommendation system for cultural heritage.**

Limitation

Figure 13 shows the results of inputting Korean directly into models trained on Korean. Korean language errors include these models’ inability to generate appropriate sentences due to their limited historical context training. In text-to-image generative models, attempts to use clusters of Korean words directly resulted in low completion quality for Korean. Even when using Korean, the problem of randomly generating diverse features was discovered rather than consistently creating similar features for the same word. In Fig. 13a lacks images that closely resemble the original. The cultural heritage name of this image implies a tower made of stone, but except for DALL-E 3, images with different characteristics from the original image are generated. For Fig. 13b, the original image is a nameplate, but using Korean explanations results in completely unrelated images. Overall, DALL-E 3 generates images most similar to the original, but as it is not open-source, it cannot generate large quantities of images. Therefore, attempting to translate Korean into English sentences resulted in generating better image results.

**Fig. 13: Comparison of visualization results from text-to-image generative models using Korean text input.**

One of the main limitations of the system is the categorization of abstract classes. This problem arises when defining classes for unclear concepts, such as dividing age or industries for association classification. Especially in the case of age classification, where the historical context of a country is intertwined with many other historical contexts, it is difficult to categorize them with clear criteria. Since the generated image does not contain a wide range of features, it is difficult to distinguish between them, so it is expected that increasing the dataset will significantly improve the performance of learning various features from the feature map. To improve performance, we plan to generate images with more complex features and apply methods such as negative prompts to improve performance.

Conclusions

The method proposed in this study is to automatically manage cultural heritage materials with varying amounts of information by analyzing their associations with images generated using textual information. This study suggests that it can make a significant contribution to the classification of cultural heritage, not by competing for high performance but by its potential impact. We have adopted a method that moves beyond traditional approaches of classifying and managing cultural heritage solely based on textual information by integrating various algorithms. This approach will also be useful for quickly identifying the relevance of new information and automatically categorizing it as it is added. This is the first step towards reducing the cost and time of large-scale retrieval systems, and will lead to more sophisticated and efficient management.

In future work, we will include classes not covered in our experiments and improve the classification performance of abstract classes. Metadata not used in the research will also be further refined and continually updated to MUCH. We will apply these improvements to an automated system to validate their effectiveness. The expanded system will contribute to improving the digital archives of cultural heritage and serve as a new starting point for various interdisciplinary research efforts.

Categories: Uncategorized

Spatial evolution of traditional waterside settlements south of the Yangtze River and the distribution of settlement heritage: evidence from the Nanxi River Basin

The study of ancient settlements in the traditional waterside towns of Jiangnan is an important part of scientific research on architectural heritage. This study examines ancient settlements in the Nanxi River Basin during various historical periods, such as the Neolithic Age, Eastern Han Dynasty, Tang-Five Dynasty, Song-Yuan Dynasty, Ming Dynasty, and Qing Dynasty. It investigates their temporal and spatial evolution and the factors that influence their distribution, with a particular focus on the role of intangible cultural heritage. This study focuses on the relationship between the spatial evolution of traditional waterside settlements in the Nanxi River Basin and the distribution of intangible heritage and analyzes the driving factors of their development. How settlements changed over time and space was examined with geographic information systems (GIS) software and by using kernel density, elliptical variance, and spatial autocorrelation methods on 204 ancient settlement points. This study also employs buffer and data overlay methods to analyze the factors that affect settlement distribution by elevation, slope, water system distance, and distance to intangible cultural heritage points. The study reveals the following. (1) During the Ming and Qing Dynasties, clans, culture, and the economy drove the expansion of early settlements, which relied on water systems and flat terrain, to form a multicenter distribution. (2) The settlement distribution in the Nanxi River Basin has undergone a transformation from single-point distribution to multipoint aggregation and divergence during the evolution from the Neolithic Age to the Qing Dynasty. The overall center of gravity of the settlements shifts from south to north and east, and the overall distribution of the settlements is in a state of aggregation. (3) The spatial and temporal evolution of settlements is jointly influenced by the natural environment and cultural factors. The natural environment determines the spatial distribution of early settlements, while cultural factors promote the evolution and development of the settlement space. This study further clarifies the key role of intangible cultural heritage in the formation and development of settlements and provides a reference framework for future heritage protection policies.

AdMaPlace March 16, 2025

0 Comments

Identification, deterioration, and protection of organic cultural heritages from a modern perspective

Organic substances such as fibroin, collagen, and cellulose are vital components of organic cultural heritages, carrying significant ancient cultural information. However, their sensitivity to environmental factors leads to heritage deterioration and reduction of values. This review briefly introduces the composition of several major organic cultural heritages (silk fabrics, leather, parchment, paper, and wood), focusing on their multilayer structure of the molecules. All aspects of organic heritages are evaluated from surface to interior using modern analytical techniques. Furthermore, the review covers the different deterioration mechanisms of organic cultural heritages by temperature, humidity, light, air pollutants, and microorganisms. Hydrolysis and oxidation are the main deterioration formats during all types of cultural heritages. The original degradation of silk fabrics and paper took place in the amorphous region, while both the crystalline and amorphous regions are destroyed as aging progresses. Compared to silk fabrics, leather and parchment are more prone to suffer bio-deterioration due to the weakness of the covalent bonds between the tanning agent and collagen. Compared to traditional contact conservation methods, contactless methods provide protection while avoiding damage to the fragile and precious organic heritages, which promotes the development of biopolymer-based composites as a promising alternative. In conclusion, it describes potential challenges and prospects for the appropriate conservation of organic cultural heritages. The comprehensive exploration of organic cultural heritages from a modern perspective is expected to promote its preservation and the transmission of history and culture.

AdMaPlace March 16, 2025

0 Comments

How digital technologies have been applied for architectural heritage risk management: a systemic literature review from 2014 to 2024

This systematic literature review critically examines the application of digital technologies in architectural heritage risk management from 2014 to 2024, focusing exclusively on English-language publications. As the significance of architectural heritage continues to be recognized globally, there is an increasing shift towards integrating digital solutions to ensure its preservation and management. This paper explores the evolution and application of digital technologies such as Building Information Modeling (BIM), Geographic Information Systems (GIS), and advanced imaging techniques within the field. It highlights how these technologies have facilitated the non-destructive evaluation of heritage sites and enhanced accessibility and interaction through virtual and augmented reality applications. By synthesizing data from various case studies and scholarly articles, the review identifies current trends and the expanding scope of digital interventions in heritage conservation. It discusses the interplay between traditional conservation approaches and modern technological solutions, providing insights into their complementary roles. The analysis also addresses the challenges and limitations encountered in the digital preservation of architectural heritage, such as data integration, the compatibility of different technologies, and the need for more comprehensive frameworks to guide the implementation of digital tools in heritage conservation practices. Ultimately, this review underscores the transformative impact of digital technology in managing architectural heritage risks, suggesting directions for future research and the potential for innovative applications in the field.

AdMaPlace March 16, 2025

0 Comments

The spatial coupling and its influencing mechanism between rural human-habitat heritage and key rural tourism villages in China

Exploring the influencing factors and its influencing mechanism of the spatial coupling between rural human-habitat heritage (RHH) and key rural tourism villages (RTV) at county scale from the perspective of space can expand the theoretical research on the spatial coupling mechanism between RHH and RTV, and further provide theoretical reference and data support for the coordinated development and high-quality development of RHH and RTV in China. At the same time, previous studies have failed to systematically analyze the influencing factors and influencing mechanisms of the spatial coupling between RHH and RTV at the county scale, which restricted decision makers from formulating coordinated development measures between RHH and RTV at the macro level. In this study, bivariate spatial autocorrelation model and spatial coupling coordination model were used to quantitatively analyze the spatial coupling level between RHH and RTV at the county scale in China. Then, the linear regression (OLS) model, geographically weighted regression (GWR) model, and optimal parameter GeoDetector (OPGD) model were integrated to systematically analyze the linear influencing, spatial heterogeneity effect and interactive effect of natural environment and socioeconomic factors on the spatial coupling level between RHH and RTV in China, and explore the interactive influencing mechanism. The results show that the spatial coupling level of RHH and RTV in China show a significant east-west differentiation. There were 2024, 473, 293, 55 and 6 areas of severe, moderate, mild, basic and moderate coordination between RHH and RTV, respectively. Among them, severe and moderate discoordination areas are mainly distributed in Northeast China, arid and semi-arid areas in Western China, plateau areas in Southwest China, densely populated urban agglomerations and plains agricultural areas in the Middle East China. Mild discoordination areas and basic and moderate coordination areas are mainly located in transition zones in mountainous and plain areas, economically developed mountainous and hilly counties along the southeastern coast, and coastal tourist cities. Economic and population factors are the fundamental factors that affect the spatial coupling between RHH and RTV. Rural tourism facilities and rural public service facilities are important external driving forces for the coupling development of RHH and RTV, and Sociocultural environment factors are the important internal driving forces. Different surface forms, different climate conditions and different ecological environment conditions can form different natural textures and spatial organizations. Suitable climate conditions, sufficient water sources and ecological environment conditions can form more suitable rural settlement construction conditions and production and living conditions, and ultimately affect the protection and activation of rural human settlement heritage and the development and layout of key tourist villages. The spatial coupling relationship between RHH and RTV is the result of the complex interaction between the natural directivity law caused by natural environmental factors and the humanistic directivity law caused by human social and economic activities.

AdMaPlace March 16, 2025

0 Comments

Leveraging large language models to assist philosophical counseling: prospective techniques, value, and challenges

Large language models (LLMs) have emerged as transformative tools with the potential to revolutionize philosophical counseling. By harnessing their advanced natural language processing and reasoning capabilities, LLMs offer innovative solutions to overcome limitations inherent in traditional counseling approaches—such as counselor scarcity, difficulties in identifying mental health issues, subjective outcome assessment, and cultural adaptation challenges. In this study, we explore cutting‐edge technical strategies—including prompt engineering, fine‐tuning, and retrieval‐augmented generation—to integrate LLMs into the counseling process. Our analysis demonstrates that LLM-assisted systems can provide counselor recommendations, streamline session evaluations, broaden service accessibility, and improve cultural adaptation. We also critically examine challenges related to user trust, data privacy, and the inherent inability of current AI systems to genuinely understand or empathize. Overall, this work presents both theoretical insights and practical guidelines for the responsible development and deployment of AI-assisted philosophical counseling practices.

AdMaPlace March 10, 2025

0 Comments

Introduction

Related work

Tasks with Cultural Heritage

Large Language Models

Diffusion Models

Methodology

Problem Setting

Prompt Refining

Text-to-Image Generator

Multi-Label Classifier

Dataset

Results

Implementation Details

Experiment Results

Quantitative results

Qualitative results

Limitation

Conclusions

Related Articles

Responses