Bigger SLMs as the Pragmatic Solution—a Dip into Their Capabilities

AI is all the rage now, especially within the broader domain of artificial intelligence, including natural language processing in all its forms and shapes. Understandably, many companies are now busy implementing Large Language Models (LLMs) in their products. Some are trying to use AI as a co-pilot for internal processes. Others are exploring how they can leverage the power of AI inside their products and apps. And yet, the whole process is not as smooth as most have expected it to be. As it went on, developers and product owners faced several roadblocks.

What are these? First and foremost, there is a high risk of data leaks and security breaches when using external LLMs, clouds, and APIs. Whenever, for example, you’re putting data into ChatGPT and receiving the answer, all this data travels around the world from our devices to a server far, far away, not as protected as it should be. This server is not ours and there’s actually no way to prevent the model from learning from our data.  Secondly, the use of API like OpenAI costs money. These costs, which seem pretty low initially, quickly add up as the quantity of data sent to and from language models grows exponentially.

How can we solve those two problems? In the previous article, we came up with one possible response to such concerns worth exploring: Small Language Models (SLMs). Despite fewer model parameters and a smaller training corpus, they are a promising alternative to large-scale models. As we process data locally and thus avoid the risk of AI data leak, with SLMs, we can optimize performance and witness awe-inspiring emergent abilities. These models have been customized and fine-tuned through machine learning to specific enterprise needs, showcasing their efficiency in natural language processing tasks. ‘

Sure, they aren’t capable of image generation. However, they handle natural language processing tasks well, show better performance at complex tasks no worse than that of larger models, even with fewer parameters, and are a valuable tool, as we will show again in this article.

In the previous article, we covered a lot of ground: from what GenAI is to SLMs—Large Language Models’ ‘smaller brothers’—their pros and cons, and specific characteristics they have. We tested the few models we considered most appropriate for our aims, such as Phi-2, Gemma, and Mistral 7B. And we put them to the test, trying to evaluate how good they are in various business contexts and when performing completely different tasks.

Now is the time to dive even deeper.

Today, we’ll be focusing on SLMs. As we already mentioned, these are much less resource-intensive and consume much less computational power. Despite their smaller size compared to their larger counterparts, such as GPT-3.5 and GPT-4, they still possess up to 8 billion parameters and can generate human-like responses. The reason for installing these models on in-house servers stems from the hardware limitations of mobile devices, which cannot efficiently handle such large systems. And, as you remember, our primary goal is to see whether it’s possible to run AI on a mobile device without compromising the data’s security and reliability. SLMs’ compact architecture, tailored efficiency, and cost efficiency make them better at content creation than very large models.

In this article, we continue our tests. We’ll explore these models’ capabilities in natural language processing more rigorously and compare them with larger models, highlighting both their advantages and the challenges associated with their use.

As we’ve already discovered, the primary argument for installing smaller language models on in-house servers is the enhancement of security and privacy. Processing data on internal company servers increases control over the data and allows for precise monitoring of its use. Moreover, hosting models on in-house servers ensures companies have full ownership of the technologies employed and allows for rapid adaptation of the model to evolving business and technological needs without relying on external providers.

But is it all roses and sunshine? Unfortunately, no. Employing models with up to 8 billion parameters on in-house servers has certain limitations. The main drawback is their inability to run these models directly on mobile devices, which eliminates the option for offline operation. Is it the only disadvantage they have? Well, it isn’t effortless. Let’s explore what they are capable of together.

Ready, set, go! Without further ado, let’s see if bigger SLMs are up to the task. 

Scaling Up: Opting For Bigger SLMs

We already know what small language models are. But are they all the same? It would be bizarre if they were.

Let’s separate SMLs into two main groups: those with up to 4 billion parameters—explored in our previous article, which you can read hereand those considerably more extensive, with up to 8 billion parameters. What difference does this difference in the number of parameters make? Understanding the model size is crucial as it directly impacts models’ computational power and capabilities with more parameters.

Generally, the more parameters there are, the more significant their computational power and capabilities. Models with more parameters, such as LLama 3 8B (by Meta), Gemma 1.1 7B (by Google), Open Chat 3.5, and Mistral 7B v0.2, typically provide better responses due to their ability to process larger context.

This capability is essential when we need the model to perform well when doing tasks that require understanding and analyzing larger blocks of text. So, when the depth and complexity of context are fundamental, those larger SLMs are more suitable for implementation in our products and apps.

On the other hand, models with fewer parameters, the ones we tested in the previous article, might struggle to grasp complex queries and generate responses based on larger datasets. This limitation significantly hampers their usefulness, as our tests have shown.

Admittedly, models with a higher number of parameters do require more operational resources. However, rest assured that even though our smartphones, while powerful, may not be up to the task just yet, more robust personal computers or servers can efficiently handle these advanced models, allowing us to fully exploit their capabilities.

The models we used are listed below. You can learn more about them or check some technical specifics on their official websites:

Optimization and Testing of Prompts for Natural Language Processing in Smaller Language Models

To make the most out of any language model, developing an adequate prompt is crucial. Creating and testing prompts is even more critical when using SML. There’s far less hope for the model to guess and intuit what output you want to get. Therefore, prompting is highly important to make SML practical and valuable—make them do the job. Creating and testing prompts are crucial steps in optimizing smaller language models, essential for their effectiveness and utility in practical applications.

The primary purpose of testing prompts is to make sure that not only does the model generate grammatically correct responses, but it also provides you with contextually appropriate and valuable information, demonstrating language understanding. After a couple of iterations, it will be pretty easy to identify the model’s weak spots and correctly assess its ability to understand what is expected. You’ll see pretty quickly how well it responds to varied queries so that, eventually, you’ll be able to fine-tune it to get responses that meet specific business and technical requirements.

Let’s dissect what this optimization process of SLMs looks like. The steps you have to take to calibrate the efficiency of the model and the quality of the output may look as follows:

  1. Creating several differing prompts and subsequently feeding them to the model. Testing a prompt this way includes:
  • Message generation: Creating responses to various prompts to assess how the model handles diverse queries.
  • Message appraisal and comparison: By simply reading the results we’ve got and analyzing the differences between them, you can easily spot the most effective prompt.
  1. Optimizing the prompts chosen through various techniques

There are a host of various prompt optimization techniques, from simple ones, such as the chain of thought, to way more advanced ones, like ReAct prompting. Some of these techniques include iterative adjustments to prompt content. In others, we’re experimenting with different structures and keywords.

When applying techniques such as the chain of thought, we encourage language models to explain their reasoning. In this case, we are seeking an answer and require the model to explain the steps it took to arrive at that answer.

Finally, it’s essential to check how well prompts fare in the context of their future application so that we are 100% sure they are optimized enough.

Automating prompt testing

When we have to deal with fairly large data sets, automating prompt testing becomes essential. This process mainly includes finding a database that is large enough or plainly generating it from scratch and comparing the performance of different models.

Finding or generating a large database helps us test prompts using extensive data sets.

Comparing model performance: means testing the performance of small models in comparison with larger ones. For example, we usually compare summaries generated by small models with those created by larger ones, such as GPT-3.5.

We can also save the results in a file in a JSON format with a specific structure thanks to a feature named “Function calling.” With it, we can efficiently extract data to evaluate later on how effective one or another particular prompt is. As a result, we can avoid having to add some values and calculate averages manually.

How we do prompt testing & optimization at Applandeo: practical application

At Applandeo, we rigorously investigated and examined various tools. Following those tests, we quickly realized that despite the wide range of such tools, most of them failed to reach the mark. More often than not, these failed to meet our expectations. The results were presented in  such a way that non-technical people would certainly struggle to understand them. 

Those poor results motivated us to develop our own tool that uses popular prompt testing techniques. The following are the stages we go through when testing prompts:

  1. Finding the right, relevant, and appropriate database, such as a collection of short notes or articles.
  2. Creating a test that would generate message summaries using our chosen local model. 
  3. Comparing these results with those obtained from a larger model, such as GPT-3.5.
  4. Developing a prompt that evaluates the generated summaries on a numerical scale, allowing us to compare these results.
  5. Running the prompt test on the most advanced model available, such as GPT-4o.
  6. Once the tests on a selected number of messages are completed, we generate a report as an HTML file, which significantly facilitates the analysis and evaluation of models.

In summary, creating and testing a prompt is a comprehensive process that requires detailed analysis and iterative tests to maximize the performance and usefulness of smaller language models in various usage scenarios.

Text Summarization with Small Language Models: Business Case

Language models are beneficial when we need automatic text summarization done well. With each passing day, it becomes more apparent that extracting essential information from texts is AI’s forte, no matter the style of the text. It comes as no surprise, therefore, that this can be very useful in analyzing large amounts of data or when there’s a need to condense articles to their essential points.

Our team of developers at Applandeo conducted a series of advanced tests on a number of SLMs. We developed a prompt that we initially tested on tiny models. You can check the results in a previous article in this series.

To fine-tune the initial prompt, we utilized the CNN/DailyMail dataset, a crucial piece of training data available on the Hugging Face platform. This dataset, pivotal in summarizing articles, allowed us to refine the prompt for generating concise content accurately, underscoring the importance of high-quality training data in enhancing model performance.

We tested the prompt on the four models:

  1. OpenChat 3.5: Although 30% of the summaries from this model were better than those generated by GPT 3.5, we encountered problems with random hallucinations, repetitions, and improper formatting of summaries.
  2. Mistral 7b v0.2: Similar to OpenChat, it generated summaries comparable to GPT 3.5 without hallucinations and formatting issues.
  3. Gemma 1.1 7b: This one rarely produced summaries better than GPT 3.5. They didn’t contain hallucinations but had a few additional unwanted elements in the results nonetheless.
  4. Llama 3 8b: Only a few summaries generated were better than GPT 3.5. These often featured hallucinations and unwanted additions despite an optimized prompt.

Additionally, we tested our prompt on more extensive texts, trying to stay within the limits of our model’s context. We used articles from our blog exported to .txt files. In the case of these larger files, the differences between the models started to blur. All in all, unsurprisingly, GPT 3.5 still provided more key information in the summaries. Mistral was the only model that struggled with some texts.

The results of text file summarization test

As a longer text we used an article from our blog. We saved the text in a .txt format and fed it to our models, asking them for a summary of this text, as well as a list of the main points of that text.

GPT 3.5

SUMMARY: 

The Digital Markets Act (DMA) allows businesses to choose third-party payment providers, reducing costs and improving user experience. DMA also aims to increase fairness and competition on digital platforms, affecting app developers, businesses, and end users. It has been in place since November 2022, with full implementation just beginning in March 2024.

MAIN POINTS:

  • DMA allows businesses to choose alternative payment providers, reducing fees and increasing flexibility.
  • The law aims to increase fairness and competition on digital platforms.
  • Full implementation of DMA began in March 2024, with changes expected to take 6-12 months to fully affect app developers and end users.

Open Chat 3.5

SUMMARY: 

The Digital Markets Act (DMA) is transforming mobile monetization and payment options, allowing developers greater freedom in selecting alternative payment service providers and saving costs with third-party solutions. This law aims to reduce the dominance of tech giants by offering users more choice, resulting in increased fairness and competition on digital platforms.

MAIN POINTS: 

  • The DMA introduces significant changes that affect gatekeeper platforms’ payment processors sector, allowing developers to use third-party payment systems instead of the gatekeepers’ own.
  • App developers can now choose alternative payment service providers based on factors such as cost and customer preferences.
  • The DMA promotes competition and greater flexibility in managing payment ecosystems, empowering developers and businesses with alternative payment options.
  • Cost comparison shows that using third-party solutions like Stripe may result in lower transaction fees for app developers compared to Google Play Store and App Store’s high fees.
  • Alternative payment providers offer more invoice transparency and customization compared to gatekeepers like Google and Apple, leading to better invoicing management.

LLama 3 8b

SUMMARY:

The Digital Markets Act (DMA) changes mobile monetization and payment options by allowing app developers to select alternative payment service providers. This move aims to increase fairness and boost competition on digital platforms, as Google and Apple will now have to accept payments through third-party solutions. The DMA introduces significant changes from a payment processor sector perspective, enabling businesses and app developers more flexibility in choosing their payment service providers.

MAIN POINTS:

• App developers can choose alternative payment service providers instead of relying on platform-specific payment systems.

• The DMA aims to increase fairness and boost competition on digital platforms by allowing for third-party payments.

• Google and Apple will have to accept payments through third-party solutions, giving app developers more flexibility in managing their payment ecosystems.

Mistral 7b v0.2

SUMMARY:

The Digital Markets Act (DMA) entered into force in March 2024, aiming to rein in tech giants and promote fairer competition in digital markets. Key implications include mandatory unbundling of payment services from gatekeepers’ core platform services, enabling businesses and app developers more flexibility in choosing their payment service providers. This change will bring down transaction fees for developers using alternative payment systems like Stripe, improving cost savings.

MAIN POINTS:

  • The DMA aims to promote competition and fairness in digital markets by requiring gatekeepers to accept third-party payment processors.
  • Businesses and app developers will benefit from more flexible payment options and potential cost savings with alternative providers like Stripe.
  • The unbundling of payment services is part of a wider EU Digital Strategy process, expected to positively impact mobile apps, their developers, and users.

Gemma 1.1 7b

Summary:

European legislation known  as Digital Markets Act (DMA) mandates app developers in major markets like EEA and US with an alternative payment option alongside existing platforms’ billing systems . This will result improved competition, cost savings ,and enhanced flexibility of monetization strategies for mobile apps

MAIN POINTS:

  • DMA compels gatekeepers to accept third party payments.
  • Reduced fees compared  to platform’s original charges (Google & Apple).
  • Developers can choose between platforms billing or alternative solutions and pay reduced service fee based on revenue from external payment options in EEA, AU , BR etc..

Our verdict? As a result of the tests we conducted, we can come up with the following judgment on the efficiency of medium-sized language models. All in all, even smaller models can effectively summarize texts close to the absolute maximum contextual capabilities of the models. Generating summaries of more extensive texts locally does not require immense resources, after all. The differences between the models and GPT 3.5 are often minimal. Therefore, both individual users and businesses can make use of them in their work.

Spell check

Medium-sized language models offer a unique set of capabilities that can significantly enhance our text-related tasks. Whether it’s summarizing, rewriting, analyzing, or even correcting grammar, punctuation, or typos, these models excel in every aspect. Need to spot and fix a language error? Do you require additional editorial, academic, or business support? Or, perhaps, you’re seeking an assistant for your daily communication? These models are your answer.

One of the key benefits of medium-sized language models is their ability to automate the correction of grammatical and punctuation errors in written documents. This not only ensures accuracy but also significantly accelerates the editing process. As a result, these models have become indispensable tools for copywriters, editors, and anyone working with texts, including myself.

It’s 2024, and by now, almost all of us have already dabbled in typing assistants one way or another. By now, most of these tools have started doing a good job. Most of these tools are based on large models. There are many such tools available on the market today. These tools are AI-based (e.g., Grammarly, Bing Spell Check, LanguageTool, Sapling). We can always go the most straightforward way and simply use Chat GPT.

The main question is: how good is an average model that can be run on our server compared to the tools we listed above?

To find the answer, we created a set of various texts with numerous grammatical errors and typos in English. For each text, I followed these steps:

  • Put the text into the medium model of choice to correct any existing grammar mistakes and typos.
  • Instruct it to count the number of corrections it made.
  • Re-check the corrected text using the GPT-4 model.

The table below shows the percentage of errors found by each model, using the total of errors identified by GPT-4 as the benchmark.  

The percentage of errors found by each model
ModelLLama 3 8bMistral 7b v0.2Open Chat 3.5
Corrected 72.9%82.1%78%
* Percentage of errors found by each model, using the total of errors identified by GPT-4 as the benchmark.

At first glance, the results may seem a bit disappointing. A text improvement tool implies impeccable performance.

The OpenAI model keeps finding errors here and there. Yet, after thoroughly analyzing the “corrected” errors manually, the results are perceived differently. It becomes clear that some of the grammatical and punctuation errors identified by GPT-4 are not actually incorrect. In this case, instead of correcting a mistake, GPT-4 went further and enhanced an already acceptable text.

Additionally, we selectively checked the quality of the corrected text. To do this, we copied it into one of the popular text editors out there to count and highlight the number of errors for us. This was to ensure the text was correct and that GPT-4, in most cases, enhanced it.

Thus, the most significant “problem” of medium-sized models after text analysis seems to be typos. See for example:

  • Text corrected by Open Chat: words of art
  • Text corrected by GPT-4: words of arc

Both ‘art’ and ‘arc’ are the ‘ right’ words, yet with a different meaning. In the correction above, GPT-4’s output is the correct response. In this case, the advantage larger models enjoy over smaller ones is hard not to notice. Large models understand the context of the entire text much better, which helps them choose the appropriate word. Of course, not all unresolved issues can be explained as we did above. Some typos or grammar issues appeared nonetheless. Obviously, there’s a limit to what our testing models could handle.

However, the biggest downside of tested models compared to, for example, GPT-4, is multi-language support. The above tests were conducted on English texts, and for this language, I recommend using medium-sized models without any hesitation.

Despite the advantages, as we described earlier, of large models over the ones we tested, medium-sized language models are undoubtedly indispensable tools. Writers or any other professional can use them in any messaging app as an advanced autocorrect mechanism. And the one capable of going beyond typos alone at that.

But what about the issue of language support? Well, with each passing day, the number of models with built-in support for less common languages keeps growing. This, in turn, allows us to use the appropriate model depending, for example, on a phone’s language settings. 

Grammarly is cool. Is it safe, though? That’s a tough one. Hosting models on-premises gives us a significant privacy advantage. Due to data privacy obligations, many companies prohibit using Grammarly in messaging apps or even on corporate laptops. So, looking for alternative solutions that prioritize data security becomes an issue of paramount importance.

We also tested the Gemma model. While it didn’t perform as well as other models in some areas, it excelled at text summarization. Thus, Gemma may struggle to process text with typos and errors. On the other hand, it knows how to generate sufficiently good new text that is linguistically correct and contextually relevant, to boot. These are the insights we can draw from our comprehensive test data sets.

Summary

Small language models, with up to 8 billion parameters—medium-sized models, as we called them—effectively coped with analyzing both short and long texts, showcasing their efficiency, customization potential, and role in democratizing AI. As we discovered, their primary limitations are the maximum context size and lack of support for languages except English. And yet, versions of these models already enjoy support for an extended context range, such as Llama-3-8B-Instruct-Gradient-1048k.

Optimizing prompts for smaller models might take a bit more of your time, which is understandable due to their limited text comprehension abilities. However, with the right techniques and language models, it is possible to effectively process and analyze extensive text arrays. Therefore, the models we use require a dedicated server or a more powerful personal computer to maximize their performance in various application scenarios. This potential for effective text analysis should give you hope for the future.

Large models can still deliver better results in some cases, but the benefits of models that we run on our own servers are as clear as day. They address all the data security issues associated with large models while still providing satisfactory outcomes. Maybe it’s time you experimented with those models yourself. Applandeo already did, and we are left satisfied.

Let's chat!

Bigger SLMs as the Pragmatic Solution—a Dip into Their Capabilities - marcel-100px Hi, I’m Marcin, COO of Applandeo

Are you looking for a tech partner? Searching for a new job? Or do you simply have any feedback that you'd like to share with our team? Whatever brings you to us, we'll do our best to help you. Don't hesitate and drop us a message!

Drop a message
Bigger SLMs as the Pragmatic Solution—a Dip into Their Capabilities - Start-a-project