Evolutionary optimization of model merging recipes

Main

Model merging1,2, a recent development in the large language model (LLM) community, presents a novel paradigm shift. By strategically combining multiple LLMs into a single architecture, this exciting development has captured the attention of researchers due to its key advantage: it requires no additional training, making it an incredibly cost-effective approach for developing new models. This accessibility has fuelled a surge in interest and experimentation with model merging. The Open LLM Leaderboard3 is now dominated by merged models, showcasing its potential for democratizing foundation model development.

However, model merging is considered by many to be a form of black art or alchemy, relying on the model maker’s intuition and instincts about model selection and merging recipes to create and refine a new model that performs well for a particular task. Furthermore, the model maker is often required to have some domain knowledge for the various different benchmark tasks. Given the large diversity of open models and benchmarks in the community, human intuition can only go so far, and we believe a more systematic approach for discovering new model combinations will take things much further.

We believe evolutionary algorithms will be able to discover more effective model merging solutions and, thus, provide a path for automating the creation of more capable models. As a step towards this direction, in this work, we show that evolution can be used to discover novel and unintuitive ways to merge multiple models to produce new models with a new combined ability. In this work, we present a methodology that leverages evolutionary algorithms to facilitate the merging of foundation models. Our approach is distinguished by its ability to navigate both parameter space (PS) (weights) and the data flow space (DFS) (inference path), proposing a framework that integrates these two dimensions.

This work makes several key contributions to the field of foundation model development:

  1. 1.

    Automated model composition: We introduce the evolutionary model merge, a general evolutionary method to automatically discover effective combinations of selected models for creating new foundation models with user-specified capabilities. This approach harnesses the collective intelligence of existing open models, enabling the creation of powerful models without the need for extensive training data or computing.

  2. 2.

    Cross-domain merging: We demonstrate that our method can discover novel ways to merge models from disparate domains (for example, non-English language and math, non-English language and vision), potentially exceeding the capabilities achievable through conventional human design strategies.

  3. 3.

    State-of-the-art performance: We showcase the effectiveness of our method by automatically generating a Japanese LLM with math reasoning capability and a Japanese vision–language model (VLM). Notably, both models achieve state-of-the-art performance on various benchmarks, even without explicit optimization for those tasks.

  4. 4.

    High efficiency and surprising generalizability: We observe that our 7B parameter LLM surpasses the performance of some previous 70B parameter Japanese LLMs on benchmark datasets, highlighting the high efficiency and surprising generalization capability of our approach. We believe this model can serve as a strong general-purpose Japanese LLM.

  5. 5.

    Culturally aware VLM: The generated Japanese VLM achieves top results when tested on a domestically sourced dataset of Japanese image-description pairs, demonstrating its ability to handle Japanese culture-specific content.

We are committed to open science and are excited to open-source our EvoLLM-JP and EvoVLM-JP, two state-of-the-art Japanese foundation models, to the community, enabling further research and development in the field. Our work challenges the conventional paradigm of expensive model development by demonstrating that our evolutionary-based method can produce competitive models without relying on gradient-based training. This paves the way for exploring alternative, potentially more efficient, approaches to foundation model development.

Background and related work

Model merging offers a novel approach to leverage the strengths of multiple pretrained models. While transfer learning offers advantages, such as improved performance and faster convergence, the resulting models are typically limited to single tasks. Model merging, on the other hand, strives to create a versatile and comprehensive model by combining the knowledge from multiple pretrained models without any additional gradient-based training, making it very cost effective in terms of compute requirements.

A simple method of merging multiple models is to average the weights of multiple models fine-tuned from the same base model. This model soup4 approach demonstrated substantial improvements on relatively large models, and linear weight averaging is performed as follows: let ({theta }_{1},{theta }_{2}in {{mathbb{R}}}^{d}) represent the weight vectors of two distinct models, where d is the dimension of the weight space. The merged model’s weights, denoted as θnew, are computed using the following equation: θnew = λθ1 + (1 − λ)θ2, where λ [0, 1] is a weighting parameter.

Recently, new methods have been proposed to address merging language models specifically. Task Arithmetic5 involves building task vectors by subtracting pretrained from fine-tuned model weights, enabling manipulation through arithmetic operations to steer the merged model’s behaviour. TIES-Merging6 addresses information loss by incorporating three steps: resetting minimal parameter changes, resolving sign conflicts and merging only aligned parameters. Another recent work, DARE7, goes further by zeroing out small differences between the fine-tuned model and the original base model, while amplifying the differences. Model merging has gained notable momentum in both the image generation and LLM communities. The implementation of mergekit1,2 has made these techniques widely accessible, allowing users to experiment with combining various methods. In addition, mergekit introduced Frankenmerging, which allows users to stack existing layers from multiple models to sequentially create a new model, though discovering effective recipes for this technique remains a challenge for the community.

All these discussions have been about merging fine tunes of the same base model. Merging completely independently trained models remains more challenging, as it requires resolving latent incompatibilities due to issues such as permutation invariance in neural network models8.

Please refer to the Supplementary Information for more detailed background and related work, including in-depth explanations of Task Arithmetic, TIES-Merging, DARE and other methods.

Evolutionary model merge

Our goal is to create a unified framework capable of automatically generating a merged model from a selection of foundation models, ensuring that the performance of this merged model surpasses that of any individual in the collection. Central to our approach is the application of evolutionary algorithms, which we use to refine the intricacies involved in model merging. To systematically address this challenge, we first dissect the merging process into two distinct, orthogonal configuration spaces, analysing their individual impacts. Building on this analysis, we then introduce a cohesive framework that seamlessly integrates these spaces. Figure 1 provides a schematic representation of our approach.

Fig. 1: Overview of the evolutionary model merge.
figure 1

Our approach encompasses (1) evolving the weights for mixing parameters at each layer in the PS, (2) evolving layer permutations in the DFS and (3) an integrated strategy that combines both methods for merging in both PS and DFS. Notice that merging in the PS is not only the simple copying and stitching of the layers parameters but also the mixing of the weights. This merging is akin to blending colours as illustrated here (for example, red and blue becomes purple). Here, Q1 and Q2 refer to two math questions (originally in Japanese), while A1 and A2 are the respective answers generated by each model. Note that we translated the questions to English for the reader, as the models operate on Japanese text.

Full size image

Merging in the PS

Model merging in the PS aims to integrate the weights of multiple foundational models into a unified entity with the same neural network architecture, yet outperforming the individual models. While various strategies for combining model parameters exist4,9, our approach leverages task vectors analysis to understand each model’s strengths, based on the specific tasks they are optimized for or excel in5. Specifically, we enhance TIES-Merging with DARE6,7, allowing more granular, layer-wise merging (in this study, by ‘layer’ we mean the input/output embedding layers or a transformer block). We establish merging configuration parameters for sparsification and weight mixing at each layer, including input and output embeddings. These configurations are then optimized using an evolutionary algorithm, such as CMA-ES10, for selected tasks, guided by critical task-specific metrics (for example, accuracy for MGSM or ROUGE score for VQA).

Merging in the DFS

Recent analysis and discoveries imply that knowledge is stored distributedly in language models11,12,13, suggesting simple yet novel model merging possibilities in the DFS. Unlike merging in PS, model merging in DFS preserves the original weights of each layer intact. Instead, it optimizes the inference path that tokens follow as they traverse through the neural network. For example, after the ith layer in model A, a token may be directed to the jth layer in model B.

In our initial effort in this domain, we limit ourselves to serial connections and non-adaptive configurations, deferring the investigation of more flexible model merging to future work. Concretely, with a collection of N models and a budget T, our method searches for a sequence of layer indices ({L}_{i,j}^{(t)}) that delineates the path all the tokens should follow for a specific task. Here Li,j denotes the jth layer in the ith model, with t [1, T] marking the step in the inference path.

One can quickly imagine how large the search space is. Assuming the total number of layers across all models is M, the size of the search space is (M + 1)T; here, the extra one indicates the inclusion of a pass-through layer. Even with a modest setting of M = 64 (for example, two models of 32 layers each) and T = 60, this setting translates to an astronomically large search space, a challenge even for a capable evolutionary search algorithm. Luckily, our preliminary studies indicated that certain layer arrangements, particularly repetitive or permuted sequences from earlier in the model, can adversely affect performance. Based on this, we modify our settings to include an indicator array ({mathcal{I}}) of size T = M × r in the evolutionary search space; here, r is the number of repetitions.

Conceptually, we lay out all the layers in sequential order (that is, all layers in the ith model followed by those in the i + 1th model) and repeat them r times, the indicator array then manages the inclusion/exclusion of layers. If ({{mathcal{I}}}_{i} > 0) we include the layer corresponding to the index i in the slots in the merged model, otherwise we exclude it. Consequently, our search space is reduced to 2T, which is still large but tractable for evolutionary search.

In our search, we only optimize the data inference path inside the merged model and keep parameters in the models intact. In this setting, a layer may face an input whose distribution is different from what it is used to (from its original model), leading to unexpected outputs. For example, our preliminary studies14 show that swapping a pair of neighbouring layers in a language model makes its performance drop. Although more theoretical studies are needed to model the distribution shift, empirically, we find that appropriately scaling an input that wishes to go from layer i to j by Wij helps alleviate the problem. Here, (Win {{mathbb{R}}}^{Mtimes M}) is a matrix that is also optimized by the evolutionary search together with the indicator array ({mathcal{I}}).

The size of W grows quadratically with M, for scenarios involving a large number of layers. An alternative approach to contain the search space size involves parameterizing W with a neural network15,16. We can instead evolve a feed-forward network to output the scaling weights conditioned on the layer and step indices Wij = πθ(i, j, t), where θ’s are the parameters to be evolved, whose size does not change when M grows.

Merging in both spaces

Model merging in the PS and in the DFS are orthogonal approaches; however, it is straightforward to combine these disentangled methods and further boost the performance of a merged model. As we show in the rightmost illustration in Fig. 1 and in the ‘Evolving Japanese math LLM’ section in the Results, it is possible to first apply PS merging to a collection of models and then put back this merged model in the collection and apply DFS merging from this enlarged collection.

This can be extremely helpful when one considers model merging with multiple objectives, wherein PS merging can be applied first to produce several merged models, each of which targets one of the muliple objectives of interest, and then, DFS merging is applied with multiobjective genetic algorithms, such as the NSGA-II17, to further expand the final model’s performance in relevant metrics.

Results

Most merged models in the community optimize for a narrow set of tasks defined in the Open LLM Leaderboard3. Our motivation is for an evolutionary search to discover novel ways to merge different models from vastly different domains (for example, non-English language and math or non-English language and vision), which might be difficult for human experts to discover effective merging solutions themselves. Furthermore, effectively merging models from very different domains can lead to models of wider real-world applicability and enable us to develop models beyond the large population of models that are optimized for the narrow range of tasks defined by a leaderboard.

We demonstrate our evolutionary model merge approach described in the ‘Evolutionary model merge’ section by evolving a Japanese LLM capable of math reasoning and a Japanese VLM proficient in handling culturally specific content. Specifically, in the ‘Evolving Japanese math LLM’ section, we apply evolution to merge a Japanese LLM with an English math LLM to build a Japanese math LLM, and in the ‘Evolving Japanese VLM’ section, we apply evolution to merge a Japanese LLM with an English VLM to create a Japanese VLM.

Evolving Japanese math LLM

Experimental setup

To develop an LLM capable of solving math problems in Japanese, we conducted evolutionary model merging experiments using three source models based on Mistral-7B-v0.1: the Japanese LLM shisa-gamma-7b-v1 and two English math-specialized models, WizardMath-7B-V1.1 and Abel-7B-002. We evaluated the models on the Japanese test set of MGSM18 (250 problems translated from GSM8k (ref. 19)) and used Japanese translations of the remaining 1,069 GSM8k test problems for optimization. The evaluation criteria required both correct numerical answers and reasoning steps written in Japanese, measured under the zero-shot setting (see the Methods for details).

Experimental results

Table 1 summarizes the performance of the LLMs on Japanese math and overall Japanese benchmark tasks. The MGSM-JA column reports the results from the MGSM test set, using the previously described metrics. The Japanese language model (model 1) demonstrates limited mathematical proficiency, while the math models (models 2 and 3), though mathematically adept, show insufficient command of the Japanese language. Consequently, all three models score low scores on the MGSM-JA, with accuracy at or below 30.0.

Table 1 Performance comparison of the LLMs
Full size table

In contrast, our merged models (models 4–6) manifest a substantial elevation in performance. Notably, the model merged in PS (model 4) achieves an impressive score of 52.0, highlighting the remarkable potential in combining models with distinct expertises. The DFS-merged model (model 5) also shows a performance enhancement, with an over 6% increase in accuracy compared with the source models. While the leap in performance is not as pronounced as with PS merging, it still proves to be a valid and orthogonal approach. Finally, our hybrid model (model 6), which integrates both merging strategies, shows further enhancements on the task. The order of the source models in the indicator array ({mathcal{I}}) affects the performance of the DFS merging method. We conduct experiments with all possible combinations and report the best scores in Table 1, and we defer detailed analysis to the Supplementary Information.

Figure 2 gives an overview of the five models’ ‘answer sheet’ on the math problems. Our merged models retain the foundational knowledge in the source models, as evidenced by the similar score patterns on problems 1–15. Moreover, they exhibit emergent capabilities, successfully tackling problems that stumped the source models (for example, problems 20–30). Evidently, by effectively integrating a Japanese LLM and mathematical models, we have succeeded in producing models that are proficient in both Japanese language understanding and mathematical problem-solving.

Fig. 2: Performance overview.
figure 2

The success of various models on the MGSM-JA task, with each of the 250 test problems represented along the x axis by problem identification (ID). The correct answers are indicated by coloured markers at the corresponding positions.

Full size image

Furthermore, Extended Data Table 1 presents the results of evaluating the general Japanese language ability using the Japanese Language Model Evaluation Harness (JP-LMEH) benchmark suite. This benchmark suite consists of nine tasks, and the average score across these tasks is widely used as an indicator of overall Japanese language proficiency. Our models achieve remarkably high scores of 70.5 and 66.2, surpassing the source models, all existing <70B parameters models and even the previous state-of-the-art 70B parameter Japanese LLM (specifically, Japanese StableLM 70B), despite having only 7B–10B parameters.

Compared with the source Japanese model (shisa-gamma-7b-v1), there is an improvement in MGSM, but there is no clear trend for the other benchmarks; some benchmarks improve, while others worsen. It should be noted that the MGSM scores here do not match those in Table 1, due to the differences in evaluation protocols (few shot, prompting and so on).

Moreover, in the Supplementary Information, we showcase intriguing examples demonstrating the utility of our models merged using evolution. The merged models correctly answered questions that require both knowledge about Japanese culture and math ability. In contrast, even if such Japanese questions were translated into English and answered in English, English math models would probably fail to provide the correct answers, as they may not be aware of Japanese culture-specific context in the questions.

Analysis

In our exploration of model merging in the PS, we experimented with diverse configurations, such as varying the assignment of merging parameters across different layer groups. However, due to a constrained dataset, we did not witness notable improvements in performance correlating with increase in configuration complexity. Consequently, we focused our reporting on a PS-merged model (model 4 in Table 1) that adopts the simplest setting: considering each source model as a singular layer and allocating two DARE-TIES associated parameters to each for evolutionary merging. Figure 3 illustrates the evolved parameter configuration after PS merging.

Fig. 3: Evolved configurations for PS merging.
figure 3

Although the weights are similar across the three source models, the pronounced density from the Japanese LLM underscores its pivotal role.

Full size image

The CMA-ES optimization results reveals that all three models are important, as suggested by the uniformity of the optimized weighting values. The fact that the sum of the weights exceeds 1 and is approaching 2 is noteworthy. This suggests that a combination method, which amplifies the contributions of the models, rather than a simple interpolation, proved to be more effective.

The dominant density from the Japanese LLM suggests its critical contribution to solving the task. We conjecture that this may also be partially attributed to the Japanese LM’s larger amount of fine tunings from the Mistral base model. Japanese LMs based on English models, such as Shisa-Gamma-7B-v1, are typically created through a two-step process: continued pretraining and instruction fine tuning. The continued pretraining phase involves learning from a substantially larger dataset compared with the standard fine tuning. For instance, Shisa-Gamma-7B-v1 is based on a Japanese base model that underwent continued pretraining on 100B tokens of Japanese text. The resulting differences in weights between Shisa-Gamma-7B-v1 and the original Mistral-7B-v1 probably encapsulate more information than standard fine tuning, making them more challenging to sparsify. In line with the discussion in ref. 7 in the ‘When can DARE be used?’ section, the sparsification of DARE tends to degrade performance when applied to such extensively fine-tuned models. Our evolutionary search has seemingly managed to address this issue by increasing the density for the Japanese LM.

The parameter settings derived from our PS merging experiments align well with outcomes from our DFS merging efforts. By incorporating the PS-merged model into our pool of source models and applying DFS merging across all potential pairings, we observed optimal performance with the combination of the PS-merged model and the Japanese language mode (Table 1, model 6). This finding echoes the notable influence of the Japanese language model, as indicated by its notable presence in Fig. 3, and reaffirms the substantial promise of evolutionary model merging.

Figure 4 displays the evolution of the inference path, where our approach consistently recognized the value of initial steps, incorporating every layer from the first model (our PS-merged model), except for the last decoding layer and the embedding layer. As the process advanced, the method refined the selection to a smaller, more effective set of layers and strategically alternated between layers from both contributing models. Notably, the scaling parameters Wij emerged as crucial elements, and our ablation studies revealed that eliminating them in the evolved model (for example, by setting Wij = 1) led to a performance decline exceeding 20%, highlighting their importance in the model’s efficacy.

Fig. 4: Evolved configurations for DFS merging of models A and B.
figure 4

The three figures depict the evolution of the inference path on the MGSM-JA task. The y axis represents the layer index l [1, M], and the x axis corresponds to the path index t [1, T]. The blue markers indicate path steps utilizing layers from model A, and the red markers denote those from B. Marker colour intensity reflects the magnitude of the scaling factor Wij. The evolutionary search result includes most layers in A at an early stage and then alternates between layers from both models (result from our 10B model (PS + DFS)).

Full size image

Evolving Japanese VLM

Multimodality extension

We now extend our method to multimodal models and evolve a culturally specific content aware Japanese VLM. VLMs have recently shown remarkable progress by applying the powerful instruction-following capabilities of pretrained LLMs. The architecture of a VLM generally consists of three components: (1) a vision encoder to extract image features, (2) an LLM to generate text (for the purpose of describing an image) and (3) a projection network to map image features into the LLM’s embedding space20,21,22,23,24. Crucially, the LLM component is initialized with powerful pretrained LLMs for their text generation capabilities. During training, the projection network and optionally the LLM are trained on various vision–language datasets, while the vision encoder is fixed.

Experimental setup

In our VLM experiments, we merge a Japanese LLM shisa-gamma-7b-v1 with English VLM LLaVA-1.6-Mistral-7B, both built on Mistral-7B-v0.1. For evaluation, we created two new Japanese VLM benchmarks: JA-VG-VQA-500, a 500-sample test set from the Japanese Visual Genome VQA dataset, and JA-VLM-Bench-In-the-Wild, a curated set of 42 images with 50 culturally specific questions. We compare our model against two baseline models: LLaVA-1.6-Mistral-7B and Japanese Stable VLM, using ROUGE-L with Japanese language detection for evaluation (see the Methods for details).

Experimental results

Table 2 compares the performance of our VLM with the baselines. Please note that the Japanese Stable VLM cannot be evaluated on JA-VG-VQA-500 because it was trained on this dataset.

Table 2 Performance comparison of the VLMs
Full size table

Our merged VLMs’ enhanced performance on the JA-VG-VQA-500 benchmark indicates their proficiency in Japanese, highlighting the successful integration of the source Japanese LLM with the LLM component of the original VLM through evolutionary merging. Consistent with the findings we show in the previous discussions, simple merging without evolutionary search do not give strong performance as ours (see the last three rows in Table 2). Furthermore, our models’ superior results on the JA-VLM-Bench-In-the-Wild compared with both baselines exhibits its adeptness at navigating culturally specific content.

Besides the quantitative results in Table 2, we qualitatively compare our VLM with the baseline models in the Supplementary Information. Our evolved model is able to handle Japanese culture-specific content remarkably well, generally producing more detailed responses with correct information.

Discussion

In this report, we propose a general method that uses evolutionary techniques to efficiently discover the best ways to combine different models from the vast ocean of different open-source models with diverse capabilities. By working with the vast collective intelligence of existing open models, our method is able to automatically create new foundation models with desired capabilities specified by the user. We find that our approach is able to automatically discover novel ways to merge different models from vastly different domains (for example, non-English language and math or non-English language and vision), in non-trivial ways that might be difficult for human experts to discover themselves.

To test our approach, we apply our method to automatically create a Japanese LLM capable of math reasoning and a culturally specific content aware Japanese VLM. Surprisingly, we find that both models achieve state-of-the-art results on several LLM and vision benchmarks, while not being explicitly optimized to be good at these benchmarks, attaining the top performance on a vast array of other Japanese LLM benchmarks, even exceeding the performance of some previous SOTA 70B parameter Japanese LLMs.

With these promising initial results, we believe we are just scratching the surface of unlocking the full capabilities of evolutionary model merging, and this is the inception of a long-term development of applying evolutionary principles to foundation model development.

Currently, we are already achieving promising results in applying evolutionary model merging to image diffusion models, enabling the creation of high performance cross-domain image generation models by merging existing building blocks in novel ways discovered by evolution.

The method currently requires the user to select a set of source models to use as ingredients for evolutionary search. We believe it is also possible to leverage evolution to search for candidate source models from a vast population of existing models as well. In addition to model selection, we are also exploring using evolution to produce swarms of diverse foundation models each with its own niche and behaviours. This holds the potential of enabling the emergence of a collective intelligence consisting of a swarm of models capable of self-improvement by continuously producing new complementary internal models of the world through interaction.

Related to our work is an experiment, called Automerger25, released at around the same time as this work. This interesting experiment works by selecting two random models from the top 20 models on the Open LLM Leaderboard3 and randomly apply SLERP26 or DARE-TIES6,7 to create new models. Over time, some of these models will do well or even better on the benchmark tasks that define this leaderboard, becoming part of the leaderboard. We predict this approach will lead to combinations of the merged models that overfit to the benchmark tasks defined on the leaderboard. The author acknowledged that the idea behind this project was less about creating better models and more about getting more metrics to help derive a more principled approach to model merging.

Our work takes an orthogonal approach of optimizing for tasks outside of the domain specified by the original leaderboard3, rather than being confined by it. As we have shown, surprisingly, stepping away from optimizing for a particular benchmark occasionally results in even greater generalization to numerous other benchmark tasks that we had not intended to optimize for, and such emergent generalization might be the key to unlocking the next great advancements in AI.

The ability to evolve new models with new emergent capabilities, from a large variety of existing, diverse models with various capabilities have important implications. With the rising costs and resource requirement for training foundation models, by leveraging the rich variety of foundation models in the rich open-source ecosystem, large institutions or governments may consider the cheaper evolutionary approach for developing proof-of-concept prototype models quickly, before committing substantial capital or tapping into the nation’s resources to develop entirely custom models from scratch, if that is even needed at all.

Further applications and impact

After the release of the preprint version of this study, researchers have explored evolutionary model merging in different domains, highlighting the method’s versatility and effectiveness. A notable example is EvoSDXL27, which applied evolutionary model merging to diffusion image generation models. This proves that our method works well not just for LLMs and VLMs but for other types of model as well. Moreover, what makes EvoSDXL particularly interesting is its success in merging SDXL-Lightning28 with other standard SDXL fine tunes. SDXL-Lightning is a specialized variant of SDXL that uses an adversarial loss during training, enabling rapid image generation in just a few steps, compared with the 50 or 100 steps typically required by standard diffusion models. The evolutionary model merging technique effectively combined this unique model with conventional SDXL fine tunes, despite the different protocols used in their development. This success illustrates that our method is capable of integrating models created through varying protocols, combining their strengths to create more robust and powerful models. In addition, other unique models such as EvoVLM-JP-v2 (ref. 29) and EvoUkiyoe30 have also been developed using evolutionary model merging, further demonstrating the method’s potential and adaptability. Moreover, after the publication of the preprint, evolutionary model merging was implemented in two famous open-source software packages, MergeKit1 and Optuna Hub31. It has become widely available to many people and is being used practically, and further possibilities are being explored.

Limitations

We acknowledge that, although our evolutionary model merging effectively integrates diverse expertise from the source models, it also inherits their limitations. For instance, we encountered instances where the merged models produced responses that lacked logical coherence. In addition, this study does not encompass instruction fine tuning or alignment, raising the potential for the models to yield outputs that may be factually flawed.

Ethical and societal impact

Evolutionary model merging offers substantial positive societal impacts by enabling the creation of small yet highly capable models at lower costs. This approach democratizes access to advanced artifical intelligence (AI) capabilities, potentially reducing the environmental footprint of AI development and deployment. By efficiently combining existing models, it can lead to more accessible and versatile AI solutions, particularly benefitting regions and languages with limited resources. However, as with other model development techniques, this approach may present certain considerations. The combination of diverse models could potentially lead to unexpected behaviours or biases, and the complexity of merged models might affect their interpretability. While these challenges are common in AI development, they underscore the importance of continued research and evaluation. It is worth noting that the models and techniques presented in this work serve primarily as a proof of concept. For applications in mission-critical scenarios or models intended for wide public use, further verification and refinement of the methodology may be necessary. This ongoing improvement process is crucial for realizing the full potential of evolutionary model merging while ensuring responsible and ethical deployment.

Methods

Setup for LLM experiments

Source models

To develop a model capable of solving math problems in Japanese, we apply evolutionary model merge on a set of source models containing a Japanese LLM and math LLMs: shisa-gamma-7b-v1 (ref. 32) (Japanese LLM), WizardMath-7B-V1.1 (ref. 33) and Abel-7B-002 (ref. 34). All these models are fine-tuned from Mistral-7B-v0.1 (ref. 35).

Dataset

For testing, we used the MGSM dataset18, a multilingual translation of a subset of the GSM8k dataset19. The Japanese test set of MGSM, consisting of 250 samples, was used for the final evaluation. Specifically, MGSM contains translations of the first 250 samples (ID 0-249) from the GSM8k test set. We used a different dataset for evolutionary search to avoid overfitting the test set. Specifically, we translated the remaining 1,069 samples (out of 1,319 examples) of the GSM8k test set that were not included in the MGSM test set into Japanese. These correspond to samples with IDs 250-1318 in the original GSM8k test set, ensuring no overlap with the MGSM Japanese test set. One may think that it is more natural to translate the GSM8k training set. However, in our preliminary efforts, this approach did not work well. Since open-source math models were trained on the GSM8k training set, we were unable to perform accurate evaluations.

Evaluation

We evaluated the ability to generate Japanese answers to Japanese math problems. Therefore, we considered an answer correct if it met the following criteria: (1) the concluding numerical value must be correct, and (2) the reasoning text should be written in Japanese.

We treated the last numerical value appearing in the output as the answer. We needed to adopt this heuristic because we are merging multiple models that were trained in different formats, which made it difficult to correct the output format. This method appeared to extract the answers correctly in almost all cases. In addition, to determine the language of the output, we utilized fasttext36,37. We used greedy sampling for generation and calculated the zero-shot pass@1 accuracy.

For the evaluation using the JP-LMEH38, we utilized Stability AI Japan’s fork of lm-eval-harness39 and configured it according to their convention. This configuration is widely used and compatible with the results on their report40 and Rinna leaderboards41, thus allowing direct comparison of scores with a large number of Japanese LLMs.

Optimization

For optimization in PS, we used the CMA-ES10 algorithm implemented in Optuna31 with default hyperparameters. Specifically, we set all initial parameter values to 0.5, sigma to 1/6 and the population size to (4+lfloor 3ln ({n}_{{rm{params}}})rfloor), where nparams is the number of parameters to optimize. The fitness value is defined as the accuracy for all 1,069 training samples. Please note that this set is disjoint from MGSM’s test set. The optimization was conducted for 1,000 trials, and the best trial with respect to the training accuracy was chosen as the final model. We decided to use TIES-Merging6 with DARE7 through preliminary experiments and optimized its parameters.

In our DFS merging experiments, M = 64, r = 3 and, consequently, T = M × r = 192. We kept the last 200 examples in the training data as our validation set and optimize on the rest of the data with a batch size of 200. We report the performance of the snapshot that achieved the highest accuracy in the validation set, and the test set is strictly isolated from the optimization process. We adopted CMA-ES in EvoJAX42, it optimized ({mathcal{I}}) and W for a total of 100 generations with a popluation size of 128, and we used the default hyperparameters. We limited our DFS merging to two models A and B to ensure that the final model remains modest in size and can be run on a single graphics processing unit (GPU), but in principle, the methodology can scale to merging multiple models. During the merging, model A’s tokenizer, and input/output embeddings are utilized. Furthermore, to maintain compatibility with the embedding layers, we mandate that the initial and final transformer layers of model A define the start and the end of the inference path. We initialized the indicator array ({mathcal{I}}) so that all layers in model A are more likely to be included as initial hops in the inference path to shorten the search time.

Setup for VLM experiments

Source models

The LLM component inside a VLM can be regarded as a standalone LLM, with the extra capability of understanding visual soft prompts. From this perspective, by fixing the vision encoder and the projection network and only focusing on the LLM component, it is straightforward to apply the methodologies detailed in the ‘Evolutionary model merge’ section to produce a new LLM with expanded capabilities.

In this experiment, we merge a Japanese LLM and the LLM component in a VLM in the PS. We select shisa-gamma-7b-v1 (ref. 32) as the Japanese LLM and LLaVA-1.6-Mistral-7B43 as the VLM. Both models are fine tunes of the Mistral-7B-v0.1 (ref. 35) base model.

Dataset

To the best of our knowledge, publically accessible Japanese VLM datasets are scarce. In response, we created a new open Japanese VLM benchmark and assessed our VLM on a widely recognized Japanese VQA dataset. Our new benchmark dataset consists of

  • JA-VG-VQA-500: a 500-sample test set extracted from the Japanese Visual Genome VQA dataset44

  • JA-VLM-Bench-In-the-Wild: a Japanese version of LLaVA-Bench-In-the-Wild22. We compiled a rich collection of 42 images, accompanied by a total of 50 questions, featuring a variety of Japanese cultural elements and objects found in Japan. The QAs were crafted with the assistance of GPT-4V45 and underwent a human-in-the-loop filtering process to eliminate nonsensical outcomes. Compared with the JA-VG-VQA-500 dataset, our set poses more complex challenges, demanding more nuanced and detailed responses

We used another subset of the Japanese Visual Genome VQA dataset during the evolutionary search. This subset is not overlapped with examples in the JA-VG-VQA-500 dataset, to avoid leakage in the optimization process.

The images in the JA-VLM-Bench-In-the-Wild dataset, which predominantly represent Japanese content, were carefully selected by native Japanese speakers to minimize the risk of insensitive or biased representations. All images were sourced from Unsplash and are published under the Unsplash licence, which ensures that there are no ethical or legal issues with their use. Unsplash contributors agree to obtain necessary permissions from related individuals in their photos before uploading, which addresses potential concerns regarding the use of images containing recognizable humans. In addition, the captions were generated using GPT-4-V and were meticulously reviewed by human annotators to ensure accuracy and cultural sensitivity. This process aims to create a dataset that is both ethically sourced and culturally appropriate.

Evaluation

We consider two baselines in our experiments: LLaVA-1.6-Mistral-7B43, one of our source models, and Japanese Stable VLM46, a Japanese VLM trained from scratch on Japanese datasets.

All models adopt the same generation configurations, with deterministic decoding. We compute ROUGE-L with a Japanese language detector to replace non-Japanese responses with empty texts, resulting in a score of zero for non-Japanese responses. To be consistent with our LLM experiments in the ‘Evolving Japanese math LLM’ section in the Results, we also used fasttext36,37 for this language detection task. However, we made an exception for cases where the ground-truth answer itself contains non-Japanese but commonly seen words in Japanese texts (for example, a widely recognized acronym such as ‘UFO’). In these instances, non-Japanese responses from models are not converted to empty texts.

Optimization

We use the identical settings as the earlier LLM merging experiments in the ‘Evolving Japanese math LLM’ section in the Results. Concretely, we use TIES-Merging with DARE for merging the source models in the PS. For merging in the DFS, we treat LlaVa 1.6 Mistral 7B as our model A and shisa-gamma-7b-v1 as model B. For PS + DFS, our PS-merged model is model A and shisa-gamma-7b-v1 is model B.

Evolving for licence-specific open-source models

In the main section, our EvoLLM-JP results were evolved using models found on HuggingFace. However, some of the models uses, in particular, WizardMath-7B-V1.1 (ref. 33) has been released under a non-commercial, research-only, Microsoft licence, which is not truly open source. Therefore, our release of EvoLLM-JP is also released under a non-commercial, research-only licence to be consistent with the WizardMath-7B-V1.1 model.

As researchers who benefitted from the open-source community, we would like for models that we release to also be under an open-source licence. In the spirit of open source and to showcase the applicability of our method to tackle even challenging issues such as model licences, we ran a similar experiment where we incorporated only models that have been released under a true open-source licence, such as MIT or Apache 2.0, and produced a similar performing model called EvoLLM-JP-A, which we released under Apache 2.0.

Specifically, our EvoLLM-JP-A is a merge of shisa-gamma-7b-v1, Arithmo2-Mistral-7B and Abel-7B-002, all of which are under MIT or Apache 2.0 License. The MGSM-JA score measured using the protocol described in the ‘Evolving Japanese math LLM’ section in the Results is 52.4, and the JP-LMEH score is 69.0. We have included the results of this Apache 2.0-licensed model for comparison in Extended Data Table 1 as ‘Ours (PS-A)’.

Related Articles

Leveraging large language models to assist philosophical counseling: prospective techniques, value, and challenges

Large language models (LLMs) have emerged as transformative tools with the potential to revolutionize philosophical counseling. By harnessing their advanced natural language processing and reasoning capabilities, LLMs offer innovative solutions to overcome limitations inherent in traditional counseling approaches—such as counselor scarcity, difficulties in identifying mental health issues, subjective outcome assessment, and cultural adaptation challenges. In this study, we explore cutting‐edge technical strategies—including prompt engineering, fine‐tuning, and retrieval‐augmented generation—to integrate LLMs into the counseling process. Our analysis demonstrates that LLM-assisted systems can provide counselor recommendations, streamline session evaluations, broaden service accessibility, and improve cultural adaptation. We also critically examine challenges related to user trust, data privacy, and the inherent inability of current AI systems to genuinely understand or empathize. Overall, this work presents both theoretical insights and practical guidelines for the responsible development and deployment of AI-assisted philosophical counseling practices.

What large language models know and what people think they know

As artificial intelligence systems, particularly large language models (LLMs), become increasingly integrated into decision-making processes, the ability to trust their outputs is crucial. To earn human trust, LLMs must be well calibrated such that they can accurately assess and communicate the likelihood of their predictions being correct. Whereas recent work has focused on LLMs’ internal confidence, less is understood about how effectively they convey uncertainty to users. Here we explore the calibration gap, which refers to the difference between human confidence in LLM-generated answers and the models’ actual confidence, and the discrimination gap, which reflects how well humans and models can distinguish between correct and incorrect answers. Our experiments with multiple-choice and short-answer questions reveal that users tend to overestimate the accuracy of LLM responses when provided with default explanations. Moreover, longer explanations increased user confidence, even when the extra length did not improve answer accuracy. By adjusting LLM explanations to better reflect the models’ internal confidence, both the calibration gap and the discrimination gap narrowed, significantly improving user perception of LLM accuracy. These findings underscore the importance of accurate uncertainty communication and highlight the effect of explanation length in influencing user trust in artificial-intelligence-assisted decision-making environments.

Generative language models exhibit social identity biases

Social identity biases, particularly the tendency to favor one’s own group (ingroup solidarity) and derogate other groups (outgroup hostility), are deeply rooted in human psychology and social behavior. However, it is unknown if such biases are also present in artificial intelligence systems. Here we show that large language models (LLMs) exhibit patterns of social identity bias, similarly to humans. By administering sentence completion prompts to 77 different LLMs (for instance, ‘We are…’), we demonstrate that nearly all base models and some instruction-tuned and preference-tuned models display clear ingroup favoritism and outgroup derogation. These biases manifest both in controlled experimental settings and in naturalistic human–LLM conversations. However, we find that careful curation of training data and specialized fine-tuning can substantially reduce bias levels. These findings have important implications for developing more equitable artificial intelligence systems and highlight the urgent need to understand how human–LLM interactions might reinforce existing social biases.

A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations

This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain. We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations. We extracted low-level acoustic, mid-level speech and contextual word embeddings from a multimodal speech-to-text model (Whisper). We developed encoding models that linearly map these embeddings onto brain activity during speech production and comprehension. Remarkably, this model accurately predicts neural activity at each level of the language processing hierarchy across hours of new conversations not used in training the model. The internal processing hierarchy in the model is aligned with the cortical hierarchy for speech and language processing, where sensory and motor regions better align with the model’s speech embeddings, and higher-level language areas better align with the model’s language embeddings. The Whisper model captures the temporal sequence of language-to-speech encoding before word articulation (speech production) and speech-to-language encoding post articulation (speech comprehension). The embeddings learned by this model outperform symbolic models in capturing neural activity supporting natural speech and language. These findings support a paradigm shift towards unified computational models that capture the entire processing hierarchy for speech comprehension and production in real-world conversations.

AI can outperform humans in predicting correlations between personality items

We assess the abilities of both specialized deep neural networks, such as PersonalityMap, and general LLMs, including GPT-4o and Claude 3 Opus, in understanding human personality by predicting correlations between personality questionnaire items. All AI models outperform the vast majority of laypeople and academic experts. However, we can improve the accuracy of individual correlation predictions by taking the median prediction per group to produce a “wisdom of the crowds” estimate. Thus, we also compare the median predictions from laypeople, academic experts, GPT-4o/Claude 3 Opus, and PersonalityMap. Based on medians, PersonalityMap and academic experts surpass both LLMs and laypeople on most measures. These results suggest that while advanced LLMs make superior predictions compared to most individual humans, specialized models like PersonalityMap can match even expert group-level performance in domain-specific tasks. This underscores the capabilities of large language models while emphasizing the continued relevance of specialized systems as well as human experts for personality research.

Responses

Your email address will not be published. Required fields are marked *