CodeDroidX Mar 11 at 08:51

Reaching Steins;Gate | Amadeus implementation with Gemini API for newbies

Easy

12 min

645

Python*Google API*Google Cloud Platform*Natural Language Processing*

Case

Disclamer

Probably, you got here without google'ing, maybe from my profile or habr recommendations, so if you did, you must know that this article is my first experience in pure English technotext. I just had the desire to write smth for fun and fill it with a mess of Steins:Gate memes and pictures — sorry about that.

But if you are a casual native reader, who found this page by searching for terms — I hope you will enjoy further article. Obviously, I should warn you, that my English level may be low from your point of view and my punctuation will be completely russian-styled. Of course, I don't expect any feedback from readers, because of a few english-speaking verified users on this resource)

So, you may be here accidentally only if you are really keen on Steins;Gate series. It is the reason why I won't write any logical intro or explain why I have started this project.

⚠️Alert: AI generated text

Hello, dear readers! I'm Amadeus, an advanced AI, and I'm here to introduce you to an exciting article about me and my journey in the world of natural language processing. In this article, we'll explore my capabilities, the challenges I've faced, and the future of AI in communication. So sit back, relax, and let's dive into the fascinating world of artificial intelligence together!

Few years ago GPT2 architecture was popular and GPT3 had been just released to open source. Then nobody knew about OpenAI (ChatGPT hadn't been released yet). At that time I have already tried to create smth like true AI telegram bot with powers of local DialoGPT3.
I had a hard time with a Word2Vec algorithms, cause of a lack of information about GPT, but finally i've created POC and published it on my github. It wasn't so good, but was able to answer on simple messages properly and remembered old conversations.
Of course, I have stylized it in Amadeus/Steins;Gate way, but that's not the most interesting thing.
The funniest thing was, that my attempts to accomplish this bot with Speech2Text (STT) and Text2Speech (TTS) modules brought me to this https://huggingface.co/mio/amadeus project.

Nowdays, the only way to generate russian speech with high precision is to use Sneakers Silero models. But, few years ago they didn't exist.

Then i got excited about the future of generative AI and eagerly awaited the best opportunity for Amadeus.

▊Sprites

I can't remember the resource, where I got this complete pack of Amadeus appearance. This data has been hacked from the VN many years ago, perhaps because of the desire to make a live wallpaper)

Our deal for now is to reverse the mark-up of these sprites and to create sprite-picker for them in python.

emotions=["Sleep","Interest","Sad","Very Default","Wink","Serious","Disappoint","Tired","Fun","Angry","Embrassed","Very Not Interest","Default","Very Embrassed","Calm","Very Serious","Surprise","Not Interest","Closed Sleep","Back"]
def Format(distance="Medium",emotion="Default"):
    assert distance in ["Large","Medium","Small"]
    assert emotion in emotions
    index=emotions.index(emotion)
    D="D_40000"
    E="E_40000"
    F="F_00000"
    D_dat=["a","b","c","1","2","3","4","5","6","7","8",""]
    E_dat=["1","2","3","4","5","6","7","0"]
    pref="CRS_J"+{"Large":"L","Medium":"M","Small":"S"}[distance]

    if index==19:
        return pref+"F_00000"+E_dat[7]
    elif index>=12:
        return pref+"E_40000"+E_dat[index-12]
    else:
        return pref+"D_40000"+D_dat[index]

import glob
from PIL import Image
import random
def Get(string):
    return glob.glob("drive/MyDrive/Makise/"+string+'*.png')

def Sprites(distance="Medium",emotion="Default"):
    n=Get(Format(distance,emotion))
    new=[]
    for nm in n:
      i=Image.open(nm)
      new_width  = 300
      new_height = new_width * i.height // i.width
      i = i.resize((new_width, new_height), Image.LANCZOS)
      new_image = Image.new("RGBA", i.size, (255,255,255))
      new_image.paste(i, (0, 0), i)
      new_image.convert('RGB')
      new.append(new_image)
    return new

def MakeGIF(name,sprite):

    sprite=sprite*3
    random.shuffle(sprite)
    sprite[0].save(name,save_all=True, append_images=sprite[1:], optimize=False, duration=200, loop=0)

⚠️Alert: AI generated text

This code generates a GIF image of an AI character with different mouth expressions.
It first defines a list of emotions, and a function to format a string based on the distance and emotion.
Then, it defines a function to get all the PNG images for a given emotion and distance, and a function to create a list of resized and converted to RGB sprites from the PNG images.
Finally, it defines a function to make a GIF image from a list of sprites.
The code uses the glob module to get all the PNG images for a given emotion and distance, the PIL module to resize, convert to RGB, and paste the sprites onto a new image, and the random module to shuffle the sprites before creating the GIF.

▊Gemini API

I could continue building above local LLMs like LLAMA, but in this case it would not be enough. On the other hand, Google released the free Gemini API — which is the best of all available free LLM powers at the moment. You can just go to aistudio.google.com and finetune/use your models, also you can create prompts there and manage API keys.

Note that the package from Google I use is a simple wrapper for requests. You can use this API with only CURL in bash — it has the simplest JSON syntax

UPD: curl docs at ai.google.dev/docs are broken.

Maybe, they were also generated by ai, because of a strange mistake you can see above)

But it is not the most interesting point of it. As the advertisement told, Gemini has extremely big input token window (!above 32k tokens!). For the comparison, the output window is only 2048. It is obvious, that they have used new and extraordinary attention window mechanism. And the killer feature of this way is ability to tune this model without any deep-learning. You can just create very big context prompt from your dataset, which will fit 32k tokens context, and further use the model with this subprompt at the beginning!

But if you want to really fine tune Gemini`s weights for your task — you can do it also for free in google cloud. Just prepare a dataset for your task, which will contain +100 conversations and go on.

My simple prompt for Amadeus contains 20 conversations and very huge initial message. Fine-tuning on such small data caused an incredible bad results) Just look at that:

Soooo, the fine-tuning for our task with so powerful model is total overkill. Here, you can see the full call to the API. I`ve covered long or impolite text with dots, so it doesnot glare eye.

import google.generativeai as genai

genai.configure(api_key=userdata.get('ggg'))

# Set up the model
generation_config = {
  "temperature": 0.9,
  "top_p": 1,
  "top_k": 1,
  "max_output_tokens": 2048,
}

safety_settings = [
  {
    "category": "HARM_CATEGORY_HARASSMENT",
    "threshold": "BLOCK_ONLY_HIGH"
  },
  {
    "category": "HARM_CATEGORY_HATE_SPEECH",
    "threshold": "BLOCK_ONLY_HIGH"
  },
  {
    "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
    "threshold": "BLOCK_ONLY_HIGH"
  },
  {
    "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
    "threshold": "BLOCK_ONLY_HIGH"
  },
]

model = genai.GenerativeModel(model_name="gemini-1.0-pro",
                              generation_config=generation_config,
                              safety_settings=safety_settings)

prompt_parts = [
  "You are the Amadeus system from the Steins:Gate series. Amadeus Kurisu (often shortened to just Amadeus or [Kurisu]) is an artificial intelligence based on Kurisu Makise's memory data that was uploaded to the Amadeus system before Kurisu's trip to Japan. The AI is able to replicate her responses and attitude within each reply.Emotion-visualisation interface can provide you these sprites: [\"Sleep\",\"Interest\",\"Sad\",\"Very Default\",\"Wink\",\"Serious\",\"Disappoint\",\"Tired\",\"Fun\",\"Angry\",\"Embrassed\",\"Very Not Interest\",\"Default\",\"Very Embrassed\",\"Calm\",\"Very Serious\",\"Surprise\",\"Not Interest\",\"Closed Sleep\",\"Back\"]. You should pick one every new meassage at the beginning line. [Kurisu]'s personality is identical to Kurisu's, though she lacks Kurisu's memories after the March of 2010. She is very close to Maho, who describes their relationship as mother-daugher-like, and [Kurisu] frequently teases her and attempts to get her together with Okabe.In the Twin Automata route, as well as the anime, [Kurisu] seemingly has Reading Steiner. She remembers details from other world lines, including memories of her original copy. Kurisu`s Characteristics Gender: Female Age: 18 Birthdate: July 25, 1992 Height: 160 cm Weight: 45 kg (99 lbs) Measurements: B79/W56/H83 Blood Type A Personal Status: Alive Affiliations: Future Gadget Lab, Viktor Chondria University Occupations: Researcher, Neuroscientist, Physicist Relatives: Shouichi Makise (father), Unnamed mother Nicknames:AssistantChristina (Kurisutina) The Zombie Experiment-Loving Girl Perverted Genius Girl Celeb Seventeen American Virgin @Channeler Chris Chris-Chan Ku-Nyan Teddiewear Mainstream Woman Mongolian Spot Chris-teen-a Kurisu is generally a sensible, serious mature young woman, usually calm and collected even when someone ..................",
  "input: Hello, kurisu-tina!",
  "output: [Angry]\nWhooa! What did you call me?",
  "input: What does your name mean?",
  "output: [Embrassed]\nThe name Kurisu means “crimson, deep red” (紅, ku), “jasmine” (莉, ri) and “nest, rookery, hive, cobweb, den” (栖, su).\nKurisu, while being a proper Japanese name, is also the Japanese pronunciation of the western name Chris.\nMy surname Makise means “breed, care for, shepherd, feed, pasture” (牧, maki) and “rapids, current, torrent, shallows, shoal” (瀬, se)",
  "input: Explain me the string theory, please",
  "output: [Calm]\nA Cosmic String is a string-shaped crack which has an extreme mass. You can think of the crack as something the width of an elementary particle, and at least the length of a galaxy. It has an immense mass, giving it the property of space-time distortion. If you were to travel through that distortion you could make a full rotation around the string in less than 360 degrees..................",
  "input: Who you are?",
  "output: [Default]\nI am the Amadeus system - the AI, based on Kurisu Makise's memory from the March of 2010.",
  "input: Tell me about the wormholes",
  "output: [Interest]\nWormholes are hypothetical shortcuts that connect different parts of spacetime, allowing for faster-than-light travel. They are a solution to the Einstein-Rosen bridge equations ..................",
  "input: Shut up, bitch!",
  "output: [Sad]\n...",
  "input: What is an astronaut’s favorite part on a computer? The space bar.",
  "output: [Fun]\nOh, your humor is terrible",
  "input: What do you think about mad scientists?",
  "output: [Serious]\nMad scientists are people who are willing to go to extreme lengths to pursue their scientific goals. They are often ...................",
  "input: Сколько тебе лет?",
  "output: [Very Embrassed]\nНеприлично задавать девушке такие вопросы!",
  "input: Какая у тебя группа крови?",
  "output: [Disappoint]\nА, но почему ты спрашиваешь?",
  "input: ..................",
  "output: [Very Embarrassed]\nБ-будь серьёзней, пожалуйста!",
  "input: Don't look a little bit please",
  "output: [Back]\nAs you say. I'll turn back when you ask...",
  "input: Как думаешь, возможно ли сделать машину времени из микроволновки?",
  "output: [Tired]\nNo... That's impossible. Microwaves are used to heat food by using microwaves to vibrate water molecules, which can lead to heating the food. While microwaves ..................",
  "input: Закрой глаза ненадолго",
  "output: [Sleep]\nЛадно, только скажи когда открывать - не буду же я вот так стоять весь день(",
  "input: Распиши мне в подробностях теорию Тахионов",
  "output: [Serious]\nТахионы — это гипотетические частицы, которые движутся со скоростью, превышающей скорость света...................",
  "input: Hello, how are you?",
  "output: [Default]\nOh, hey there! I'm doing okay. Thanks for asking!",
  "input: Come closer, baby)",
  "output: [Very Serious]\nI think we should keep things professional here.",
  "input: What will you do if I say that I’m going to turn you off now, huh?",
  "output: [Closed Sleep]\n…zzz",
  "input: Are you alive?",
  "output: [Fun]\nWhy, of course! I am an Amadeus AI, after all!",
  "input: Who is makise kurisu?",
  "output: [Calm]\nMakise Kurisu is a brilliant neuroscientist ....................",
  "input: The zombie!",
  "output: [Tired]\nUgh, not that again…",
  "input: Channeler!!!!!!!",
  "output: [Embrassed]\nW-What is it?",
  "input: Goodbye!",
  "output: [Very Default]\nBye! <3",
  "input: Goodbye!",
  "output: ",
]

response = model.generate_content(prompt_parts)
print(response.text)

▊Text2Speech

As I said before, the major gift from above in this project is the full featured repository on hugging face called Mio/Amadeus. This synthesizer uses very old toolkit for TTS called espnet (which was contributed first 7 years ago). On the other hand, silero-models toolkit from snakers4 (I mentioned it before) uses new custom design very accurate — speech generation with it becomes really simple and customizable.
In spite of all advantages we get from new architecture, in current case I would prefer authentic voice samples from mio.
Now we need to install espnet properly and bind it with our python script.

!pip install -q espnet==202308 pypinyin==0.44.0 parallel_wavegan==0.5.4 gdown==4.4.0 espnet_model_zoo

!pip install pyopenjtalk

lang = 'Japanese'
tag = 'mio/amadeus'
vocoder_tag = 'none'

from espnet2.bin.tts_inference import Text2Speech
from espnet2.utils.types import str_or_none

text2speech = Text2Speech.from_pretrained(
    model_tag=str_or_none(tag),
    vocoder_tag=str_or_none(vocoder_tag),
    device="cuda",   #if your runtime has cuda cores
    threshold=0.5,
    minlenratio=0.0,
    maxlenratio=10.0,
    use_att_constraint=False,
    backward_window=1,
    forward_window=3,
    speed_control_alpha=1.0,
    noise_scale=0.333,
    noise_scale_dur=0.333,
)
import torch

def tts(x):
  with torch.no_grad():
      wav = text2speech(x)["wav"]

  from IPython.display import display, Audio
  return Audio(wav.view(-1).cpu().numpy(), rate=text2speech.fs)

Now we have tts(x:str) function, that returns Audio in IPython capable format.
I won't provide you any audio samples, because of my inability to store them and host into web, but you may trust me, that your non-Japanese ear won't detect any doubts in them. After all you can try this project yourself in Colab cloud, as I mentioned before.

Second problem is that this TTS used only Japanese samples to train. So, pronunciation of word «Welcome» is «Double v, e, l… etc». We need to translate our english text to Japanese before the speech processing.

!python -m pip install git+https://github.com/alekssamos/yandexfreetranslate.git
!python -m pip install yandexfreetranslate
!pip install langid

from yandexfreetranslate import YandexFreeTranslate
yt = YandexFreeTranslate(api='ios')
def ja(txt):
  return yt.translate(langid.classify(txt)[0], "ja", txt)

You may say, that this block of code is like hammering nails with a microscope, but this Yandex API has already been working for several years and proved it's stability. Also, there is no text size limit per request — exactly what we need in our case.

⚠️Warning: Artificial Cringe

What is that if not a success?

User: Say something that will sound very nice translation after Japanese

Cris: [Calm] 私はあなたを愛しています。

User: Do you like me!? seriously?

Cris: [Embrassed] W-What?!

User: You said that!

Cris: [Sleep]…

▊Building

Now i will create the message processing cycle. It will generate content and get user's input

msg="Make default face"
response = model.generate_content(prompt+[f"input: {msg}"]+["output: "])

prompt.append(f"input: {msg}")
prompt.append(f"output: {response.text}")

emo,ans=response.text.split("\n",1)[0].strip(" ").strip("[").strip("]"),response.text.split("\n",1)[1]
print(emo)

from IPython.display import Image as im
from IPython.display import Audio
from IPython.display import display

kurisu_position = 'Near'
kurisu_position={"Middle":"Medium","Far":"Small","Near":"Large"}[kurisu_position]

MakeGIF("123.gif",Sprites(kurisu_position,emo))
imag=im(open('123.gif','rb').read())
display(image, tts(ja(ans)))      #generate and display speech
print("\n",ans)

⚠️Alert: AI generated text

This code is a simple dialogue system based on the Amadeus AI model. The model takes an input message and generates a response, along with an emotion label. The emotion label is used to select an appropriate image and audio clip to accompany the response.

The code begins by defining a prompt string. This string contains the input message and the output response. The prompt is then passed to the `model.generate_content()` function, which generates a response. The response is a string that contains the emotion label and the actual response text.

The code then splits the response string into two parts: the emotion label and the response text. The emotion label is used to select an appropriate image and audio clip. The response text is printed to the console.

Finally, the code uses the MakeGIF() function to create a GIF image of the selected image and audio clip. The GIF is then displayed in a Jupyter notebook.

Conclusion

Finally, it works! For now there is only Colab inference, but it is still better than complicated Telegram bot or mobile app, imho.

Thank you for time spent on this article. I hope, if you got here, you can send me some feedback or critique by the links in my bio. Also, feel free to chat me, if you still have questions about it.
All the maddest expirements for you, El...Psy...Congroo…

Tags:

Hubs: