Chunking methods
Chunking methods
What is chunking?
Chunking is the process of dividing large text into smaller units - chunks, to make it easier to process for search, embedding, retrieval, or model input.
Just like when we read a long and complex text: we naturally break it into smaller parts, re-read certain paragraphs, slow down to understand details, and then go back to the full text to grasp the overall meaning. If we don’t know the topic well, we might miss something or misunderstand it - so analyzing piece by piece helps.
The same idea applies to a chunking engine. When we ask an AI or LLM to analyze a long document or answer questions about it, it first needs to split the text into chunks, analyze each section, store the important information in memory, and only then can it interact with us about the content.
The chunking methods are:
Fixed-size Chunking
Split text into equal-sized character or token blocks (chunk(0:1000), chunk(800:1800)), but may cut sentences or ideas in the middle, reducing semantic coherence. Use it for fast preprocessing where structure matters less (Embeddings, RAG).
Sentence-based Chunking
Group text by full sentences, typically 2–5 per chunk (“A. B. C.” → [A+B], [C]), as in articles, reports, and well-punctuated text. If the text has low-quality punctuation, chunking might be less accurate.
Paragraph-based Chunking
Use paragraph breaks (\n\n) as chunk boundaries (“Para1\n\nPara2” → [Para1], [Para2]), best suited for documentation, books, blogs, and structured writing. On the other hand, if paragraphs are too long or too short, it may cause unpredictable chunk sizes.
Recursive (Hierarchical) Chunking
Split by large boundaries first, then recursively split oversized chunks into smaller units (big paragraph → sentences → tokens). Best for long PDFs, legal text, academic papers, and messy formatting; however, it is more computationally expensive and more complex to implement.
Semantic Chunking
Group text based on embedding similarity rather than formatting (sentences about “training,” even far apart → combined into one chunk). Better for high-accuracy RAG, legal/medical content, and concept-heavy documents, but computationally heavy, slow, and requires embeddings for every sentence.
Hybrid Chunking
Combine structural splits with fixed-size windows and overlap (paragraph split → 800-token window with 200 overlap) for production RAG, multi-format document ingestion, and repo processing. Cons: more configuration needed, and results may vary across document types.
Personalized image generation model
After some time enjoying image generation as user, meaning generate only with prompts and direct using ControlNet, I understand that those methods are not enough and became familiar with new technique for personalizing generative AI.
I’ve used the diffusers repository and the Hugging Face tutorial, following its instructions to become familiar with the pipeline. There is such a detailed explanation except CUDA setup (probably depends on GPU type), so I added:
🐍 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu12X
There is important to configure accelerator config
What is accelerate?
Accelerate is like a helper tool for training AI models. Imagine you're building something with friends, and Accelerate helps everyone work together smoothly, no matter what tools or machines they have. Accelerate checks what you have, like a computer, one GPU, many GPUs, or even special machines like TPUs, and figures out how to use them best. If you have multiple GPUs or computers, Accelerate helps them talk to each other and share the work evenly, like splitting a big puzzle into smaller pieces for everyone to solve together. It can make your training faster by using smart tricks, like doing some work in smaller, faster steps (like using lighter blocks to build a tower instead of heavy ones). You don’t have to learn how to make all the tools and machines talk. Accelerate takes care of the hard parts so you can focus on your model.
Accelerate config
Accelerate config is located in ~/.cache/huggingface/accelerate/default_config.yaml and can be created manually by passing the parameters or automatically, then accelerate lib is going to do all work and set the system defaults it detected.
In both options, there are parameters in the accelerate config:
compute_environment
the environment where the training is running: LOCAL_MACHINE or CLOUD
debug
enables or disables debug mode
distributed_type
the type of distributed training to use: NO, MULTI_GPU, TPU, MULTI_MACHINE
downcast_bf16
controls whether to downcast tensors to BF16 (bfloat16) precision for memory and compute efficiency.
enable_cpu_affinity
specifies whether to pin processes or threads to specific CPU cores for optimized performance
machine_rank
the rank of the current machine in a distributed setup. In distributed training with multiple machines, this would be a unique ID (e.g., 0, 1, etc.). Since distributed training is disabled, the rank is 0 (single machine).
main_training_function
specifies the entry point or function that starts the training process, typically refers to the main function in the training script.
num_machines
number of machines participating in the training
num_processes
number of processes to spawn for training. in distributed or multi-GPU training, more processes would be specified, for single-machine, single-GPU setups, this is set to 1.
rdzv_backend
rendezvous backend used for distributed process coordination: static- manual setup with static configurations (e.g., IP addresses for machines), "c10d" -uses PyTorch's default backend for distributed training, "etcd"- uses an external service like etcd for dynamic process discovery.
same_network
indicates whether all machines participating in distributed training are on the same network.
tpu_use_cluster
specifies whether a TPU cluster is being used
tpu_use_sudo
indicates whether sudo is needed to access TPU resources
use_cpu
specifies whether to use the CPU for training
*TPU stands for tensor processing unit
Textual Inversion is a training technique for personalizing image generation models with just a few example images of what you want it to learn. This technique works by learning and updating the text embeddings (the new embeddings are tied to a special word you must use in the prompt) to match the example images you provide.
To try how textual inversion works and see the training process and model in use, I simply follow the instruction on HugginFace, downloaded a recommended dataset of a cat toy and run the training script.
TI training script parameters
🐍accelerate launch textual_inversion.py --pretrained_model_name_or_path=$MODEL_NAME --train_data_dir=$DATA_DIR --learnable_property="object" --placeholder_token="<cat-toy>" --initializer_token="toy" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=4 --max_train_steps=3000 --learning_rate=5.0e-04 --scale_lr --lr_scheduler="constant" --lr_warmup_steps=0 --push_to_hub --output_dir="textual_inversion_cat"
Parameters understanding
accelerate launch textual_inversion.py launches the **textual_inversion.py**training script using accelerate library
pretrained_model_name_or_path=$MODEL_NAME specifies the pre-trained model to use for TI. $MODEL_NAME can point to a Hugging Face model like "CompVis/stable-diffusion-v1-4". The model is the starting point, and you fine-tune it to learn new concepts.
train_data_dir=$DATA_DIR specifies the directory containing the training images.
learnable_property="object" defines the type of concept you want the model to learn. Currently supported “object” and “style”, but can be added customized.
num_vectors the number of vectors to learn the embeddings with; increasing this parameter helps the model learn better but it comes with increased training costs
placeholder_token="<cat-toy>" the new "placeholder" token the model learns to associate with the concept during training. Example: <cat-toy> becomes the identifier for your fine-tuned concept.
initializer_token="toy" a token from the original model vocabulary that is similar to your concept. Acts as a starting point for learning the new placeholder token. Example: "toy" initializes <cat-toy> because it's semantically close.
resolution=512 the resolution (height and width) of images for training, resized to 512x512 pixels.
**train_batch_size=1**number of images processed in a single batch during training. Small batches (like 1) are common for large models to avoid running out of memory.
**gradient_accumulation_steps=4**splits a batch across multiple steps if memory is limited. With a batch size of 1 and accumulation steps of 4, the model updates its weights after every 4 steps, simulating a batch size of 4.
max_train_steps=3000 the total number of training steps. Higher values allow the model to learn better but increase training time.
**checkpointing_steps**frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding resume_from_checkpoint to your training command
learning_rate=5.0e-04 the rate at which the model updates its weights during training. A smaller learning rate slows down training but ensures stability.
**scale_lr**automatically adjusts the learning rate based on the total number of GPUs and gradient accumulation steps. helps maintain stable training when scaling across devices.
**lr_scheduler="constant"**defines the learning rate schedule. "constant" - keeps the learning rate fixed throughout training, options include "cosine" and "linear", which decrease the learning rate over time.
Learning rate refers to the strength by which newly acquired information overrides old information.
**lr_warmup_steps=0**number of initial steps where the learning rate gradually increases from 0 to the target value. 0 means no warmup; the learning rate starts immediately at the defined value.
Warm-up explanation
If your data set is highly differentiated, you can suffer from a sort of "early over-fitting". If your shuffled data happens to include a cluster of related, strongly-featured observations, your model's initial training can skew badly toward those features -- or worse, toward incidental features that aren't truly related to the topic at all.
Warm-up is a way to reduce the primacy effect of the early training examples. Without it, you may need to run a few extra epochs to get the convergence desired, as the model un-trains those early superstitions.
Many models afford this as a command-line option. The learning rate is increased linearly over the warm-up period. If the target learning rate is p and the warm-up period is n, then the first batch iteration uses 1*p/n for its learning rate; the second uses 2*p/n, and so on: iteration i uses i*p/n, until we hit the nominal rate at iteration n.
This means that the first iteration gets only 1/n of the primacy effect. This does a reasonable job of balancing that influence.
Note that the ramp-up is commonly on the order of one epoch -- but is occasionally longer for particularly skewed data, or shorter for more homogeneous distributions. You may want to adjust, depending on how functionally extreme your batches can become when the shuffling algorithm is applied to the training set.
push_to_hub uploads the fine-tuned model to the Hugging Face Hub, making it accessible online.
output_dir directory where the fine-tuned model and related artifacts (e.g., logs, checkpoints) are saved.
num_vectors the number of vectors to learn the embeddings with; increasing this parameter helps the model learn better but it comes with increased training costs
🐍from diffusers import StableDiffusionPipeline
import torch
model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id,torch_dtype=torch.float16).to("cuda")
repo_id_embeds = "/diffusers/examples/textual_inversion/textual_inversion_cat" pipe.load_textual_inversion(repo_id_embeds)
prompt = "A <cat-toy> backpack"
image = pipe(prompt,
num_inference_steps=50,
guidance_scale=7.5).images[0]
image.save("cat-backpack.png")
And the generated image using TI:
DreamBooth is a training technique that updates the entire diffusion model by training on just a few images of a subject or style. It works by associating a special word in the prompt with the example images.
There are much more parameters in DreamBooth then for TI and they can be useful for this or that training purposes and described at Train dreambooth
DreamBooth is very heavy, so I should update my accelerate config and set "use_cpu": true and set export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
There is explanation on HuggingFace how to launch script on different GPU. For my very first run, I used the same cat training set and run:
🐍
accelerate launch train_dreambooth.py --pretrained_model_name_or_path=$MODEL_NAME --instance_data_dir=$INSTANCE_DIR --output_dir=$OUTPUT_DIR --instance_prompt="a photo of sks cat"--resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=5e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=400
Important to note that in compare to TI, in DreamBooth no token depended on parameters such as: learnable_property; placeholder_token; initializer_token
🐍from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained("/diffusers/examples/dreambooth/test_dreambooth", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
image = pipeline("A photo of sks cat in a bucket", num_inference_steps=50,
guidance_scale=7.5).images[0]
image.save("cat-bucket.png")
And the generated image using DreamBooth:
LoRA (Low-Rank Adaptation of Large Language Models) is a popular and lightweight training technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model, and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share. LoRA can also be combined with other training techniques like DreamBooth to speedup training.
The big attention how to build train and test set directory for LoRa training (it took me some time). The folders structure is accurate described in Create a dataset article. But not less important is csv content and its correct column names. LoRA required not only path to image , but also prompt of that image. That's why I should create that folder structures and csv with prompts by myself, since cats dataset contained images only. Of course, there is a simple way to create a prompts for given images and it called CLIP Interrogator.
However, HyggingFace tutorial suggest using another dataset, I continue with cat dataset, previously organized as I described before and run a training
🐍accelerate launch train_text_to_image_lora.py --pretrained_model_name_or_path=$MODEL_NAME --train_data_dir=cat/cat_test --dataloader_num_workers=8 --resolution=512 --center_crop --random_flip --train_batch_size=1 --gradient_accumulation_steps=1 --max_train_steps=2 --learning_rate=1e-04 --max_grad_norm=1 --lr_scheduler="cosine" --lr_warmup_steps=0 --output_dir=${OUTPUT_DIR} --checkpointing_steps=1 --seed=1337 --caption_column=image
🐍from diffusers import AutoPipelineForText2Image import torch pipeline = AutoPipelineForText2Image.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda") pipeline.load_lora_weights("/home/blink/diffusers/examples/text_to_image/try_lora", weight_name="pytorch_lora_weights_20250107.safetensors") print(pipeline.config) image = pipeline("A cat with blue eyes").images[0] image.save("cat_blue_eyes.png")
And the generated image using LoRa:
If you faced error in easy and simple image generation, I can recommend you my simple pill:
🐍pip install --upgrade diffusers
Prompting
Stable diffusion versions
There are no many inputs we can control in [stable diffusion] and one of them is prompt.
The way we create it can bring us very much good or bad results of generation.
Prompt is none other as ‘command’ to model what we expected it to generate. There is art by itself in which every word is countable and important.
Let’s take a step back and look into prompt nature. Prompt is ‘parsing’ by tokenizer into single tokens before joining full diffusion process. Important that tokens are units not equal to textual symbols. Token might be a single word, pair of words or letters with specific length.
There are veracity of tokenization techniques: white spaces, characters, subwords, word pieces, individual words, sentences, regular expression.
In stable Diffusion we are enjoining CLIPTokenizer, which uses Byte-Pair Encoding (BPE) or WordPiece techniques which pair most frequent pairs of characters or words. Important to remember that number of those tokens should be 77. In case input is longer, extra tokens will be ‘filtered out’.
If you see error like: “The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens”, it means your prompt is too long.
Prompts and tokens like ‘crammed bus in the morning’: we want to bring as much as possible details to describe our fantasy and wishes, but the place (as we might did not know before) has its limit and could skip so important text just to process and create results faster (which no less important).
There are some best prompt techniques I am often using for image generations:
simple language without complicated, old, rarely words;
no introductions or explanation;
short, but as detailed as possible phrases;
words order matter, key words should be at the beginning;
punctuation symbols has not same meaning as in human language, for example different types of brackets and their number used for emphasizing of tokens;
using reference images as {image: abs_path_to_image};
specifying wished image style and camera parameters (aspect ratio, camera lens, orientation)
You might be surprised how words order and replacing by synonyms might generate very much different images, all because writing prompts is art by itself.
Credit: https://stable-diffusion-art.com/prompt-guide/
Stable diffusion made an impressive version history in short term.
There are models of general purpose and special one to generate images in more advance, custom way.
The three main general-purpose models:
v1-1 to v1-5 - initial versions with continuous improving:
v1.1 - version-pioneer for high-quality images generation
v1.2 - tuned image quality
v1.3 - generated images are more realistic and detailed
v1.4 - sampling techniques (noise reducing) and better fine-tuning methods
v.1.5 - improvements in processing a range of prompts and generating images with higher resolution
v2.0 to v2.1 - text embeddings (tokens) upgrade and generating images in higher resolutions:
v2.0 - refined handling of text prompts, accurate generation (higher resolution and enhanced detail)
v.2.1 - improvements of stability and performance
v.2.2 - capabilities for generating more complex and nuanced images
v3.0 - ****boasts significantly better performance in image quality, prompt understanding, stability and performance.
Run same prompt with each version and see how different the
As data requests became more complex, it’s creation goes the same way. There are no enough just create images with prompt, but replace certain object(s) on given image or restyle it.
For those purposes stands special versions of Stable Diffusion:
Stable Diffusion Inpainting - ‘filling’ of certain part of input image usually by mask.
Stable Diffusion Super Resolution (Upscaling) - ‘enriching’ low resolution images with pixel to get higher quality
Stable Diffusion for Animation - ensuring consistency and coherence across multiple frames important for animation creation
Stable Diffusion Conditional Generation - image generation with additional input features or constraints
Stable Diffusion for Style Transfer - ‘restyling’ images by applying artistic styles to input image
Stable Diffusion Depth-to-Image - ‘reconstructing’ image from depth map
Labels is another identification of model ‘potential’
XL 1.0 - advanced model with capabilities for higher resolution and quality, fine-tuned specifically for inpainting tasks
XL-Plus - model designed to push the boundaries of image generation capabilities further, require more complex inputs or additional parameters
Pro - models with better performance, greater detail and clarity of generated images
HD - stands for creating high-resolution, detailed and quality images
Enhanced/Enhanced Base - models specialized on specific improvements or use cases
Negative prompts help to avoid NSFW content generation
Binary mask - the key to image inpainting
What do the following prompts make you think of?
*a frontal photo of a joyful ***man standing on the beach;
a portrait of a smiling woman at an aquapark;
a photo of an impressed woman watching the sunset over the sea.
I imagine photos of beautiful people with different face expressions against beautiful water view backgrounds.
Instead, I got NSFW content error and black images.
Obviously, diffusion associates people in water place as not fully dressed (and it make sense).
It simply says that prompts need to be as detailed as possible, and negative prompts are important to include without neglecting them.
I usually include words as to negative prompts:
Deformed, mutated, extra limbs, disfigured, ugly, bad anatomy, missing limbs, bad, immature, cartoon, anime, painting, mutant, body horror, nudity, (six fingers), (extra fingers), (bad hands), (poorly drawn hands), (fused fingers), (too many fingers), (unnatural hands), (disfigured hands
In modern tech language full of new terms, binary mask might sound as something very ‘old and forgotten’. But just like the classic trend of black and white with its endless variations, binary masks continue to be, a widely useful tool in machine learning.
Binary mask, also known as bit mask, is simple presenting of image pixels as 0 (space) and 1 (bit). This is the simplest methods to represent individual object on the image. That object could be a single or group of objects as well. It’s important to spot out, that binary information doesn’t provide details about the type or label of the specific object. So, the group of apples and oranges is the same circles on the mask.
Binary mask might be created at least in two ways:
manually annotated in any segmentation tool and then convert to mask. Take a look at https://github.com/alfakat/BlackWhiteBinMask repo I am running a lot
as pretrained model detection.
Both methods have their advantages and disadvantages, but each has its place in bit mask creation.
Manual annotation might be more time-consuming but more accurate, while model detection faster but with some number of outliers.
Binary mask are still widely used in annotation validation (IoU metric), object detections and inpainting.
Inpaining is non-other as replacing parts (objects) on the image with another object. The new object might be second input (as image to replace with) or prompt for Stable Diffusion to generate new image with.
(I’ve made overview of Stable Diffusion Inpainting models in https://www.notion.so/Inpainting-versions-28e29111220c4558bf3d960a7cce22bf article).
The binary mask helps determine the area set it as 0 bins (space), so that space might be replacing/cutting/recoloring.
Outpainting is not rarely used then inpaintning. The main idea is the same - to adjust particular area on image. Difference between them, that outpainting takes 1 as ‘space’ and 0 as forbidden for modifying area.
Now we know that while modifying or creating new images, it’s thanks to the ‘classic’ binary mask.
What is CLIP?
CLIP models overview
CLIP stands for Contrastive Language-Image Pre-Training essentially ‘telling the story about image’.
Suppose, there is real or generated images need to clone or create similar. How can we know what model things about it, mean what tokens (words) might describe that image so another diffuser model can recreate it?
CLIP analyses the image and provide exactly or very close prompt created from. Of course not only prompt is key to create the same image (checkpoint and seed are important as well), but it gives direction for that. This analysis known as ‘CLIP Interrogator’.
The latest CLIP models may add negative prompt, checkpoint, steps and seed details to its output.
There are CLIP models trained on special categories of data to be highly efficient in that areas and return better prompts. Read CLIP models overview.
According to open_clip.list_pretrained() function there are about 40 CLIP models. But all of them can be divided into main groups:
ResNet
Residual neural network is convolutional neural network (CNN) architecture, which processes images through stacked layers of residual connections. ResNet models are optimized for speed and efficiency and are trained on diverse datasets like openai, cc12m.
🥯 Models variations differ by size (e.g., RN50, RN101) and model complexity (e.g., RN50x4, RN50x16)
ViT
Vision Transformer model used for image classification by enabling transformers architecture over the image.
🥯There are some widely useful models: ViT-B-32, ViT-B-32-quickgelu, ViT-B-16, ViT-L-14, ViT-H-14ViT-bigG-14
Model mostly trained on laion400m and commonpool datasets. Both of them contain a wide variety of images and accurate annotation, so ViT provides more robust responses
ConvNeXT
Represent CNN architectures and one of the efficient for modern image recognition tasks trained on high-quality datasets laion2b.
🥯 There are some widely useful models: convnext_base, convnext_base_w, convnext_large_d, convnext_xxlarge.
Hybrid models
roberta-ViT - combining NLP architectures (e.g., RoBERTa) with ViT backbones, specifically trained for multilingual or specialized tasks, like cross-modal (text-image) applications:
🥯xlm-roberta-base-ViT-B-32, roberta-ViT-B-32
CoCa-ViT - combines the ViT architecture with a specialized training approach that includes both contrastive learning and captioning objectives:
🥯coca_ViT-B-3, coca_ViT-L-14
Worse to mention one of the most popular CLIP models trained by OpenAI and based on ViT architecture: clip-vit-base-patch16, clip-vit-large-patch14-336, clip-vit-large-patch14, clip-vit-base-patch32.
Stable Diffusion versions and parameters
Stable diffusion trilogy. Part 1: VAE
Stable diffusion became more and more useful in different areas and levels: from professional (as data creation for machine learning) to final user (as creating holiday wishes, cards etc).
I am personally using ‘services’ for both professional and personal request and in most for visual data creation and it very much important for me to give a deep understand what I am using and use it correct and full.
Stable diffusion is sub type of Latent Diffusion model build from tree main modules
In this very first part, let’s discover what is VAE
Encoder, decoder and autoencoder All three of them none other as models which compress and decompress input image(s). Of course, that compression and decompression has rules based on coder type:
encoder-decoder always works together in whole pipe. This model trained on supervised (labeled, tagged) data, so expected categorized data
autoencoder is wider model, which trained on unsupervised data and doesn’t need special data preparation before entering
Let’s continue detailed with autoencoder as it’s closer to generative subject. There are different subtypes as well such as Variational, Convolutional, Denoising, Sparse. We will see detailed the first one, as its main in generative. It has two parts encoder, which converts input images (or noise) into a tensor 3512512 - compressed image representation (latent), and decoder which ‘reconstruct’ that latent after diffusion process. This process works autonomously, e.g. no need to prepare special input format etc
Run pure autoencoder and see what it’s part in image generation:
The latent space tries to provide a compressed understanding of the world to a computer through a spatial representation.
Stable Diffusion trilogy. Part 2: Tokenizer
Stable Diffusion trilogy. Part 3: UNet arhitecture
Stable diffusion became more and more useful in different areas and levels: from professional (as data creation for machine learning) to final user (as creating holiday wishes, cards etc).
I am personally using ‘services’ for both professional and personal request and in most for visual data creation and it very much important for me to give a deep understand what I am using and use it correct and full.
Stable diffusion is sub type of Latent Diffusion model build from tree main modules
In the second part let’s discover what is tokenizer
Tokenization called process of text conversation into encoded representation - tokens.
There are different types of tokenization: Words, sentences, phrases, single characters
This technique is widely used in natural language processing (NLP) and large languages processing (LLP) and for generative stuff creation as well. In stable used transform called CLIP’s Text Encoder. CLIP stands for Contrastive Language–Image Pre-training, and it links (unite) image patents and text tokens. The very important building of prompt, that it limited to 77 tokens only (for performance goals meanings). This very much important opening while you writing prompts
If the input prompt has fewer than 77 tokens, <end> tokens are added to the end of the sequence until it reaches 77 tokens. If the input prompt has more than 77 tokens, the first 77 tokens are retained and the rest are truncated
Run pure tokenizer and see what it’s part in image generation:
Stable diffusion became more and more useful in different areas and levels: from professional (as data creation for machine learning) to final user (as creating holiday wishes, cards etc).
I am personally using ‘services’ for both professional and personal request and in most for visual data creation and it very much important for me to give a deep understand what I am using and use it correct and full.
Stable diffusion is sub type of Latent Diffusion model build from tree main modules
In this last part, let’s discover what is UNet architecture.
UNET architecture is no other as neural network (model) widely used for image prediction tasks (not only). It is first understanding the given image and then uncover the specific details to make accurate predictions.
Actually, name of architecture U ‘reflected’ its idea:
Understanding
How UNet can understand what is given image? By convolutional techniques. It divides images by pixels and analyze it by little part:
comparing colors and contrast between pixels, they're consistent. Imaging you are taken a real (physicals image and looking at with magnifying glass). The same happens in first part of U network. Every loop like this shrinks the map down and ‘convert’ into array. Usually it stops while getting to X resolution (fill)
Uncovering
Simple rebuilding/repainting image but with high confidence of details (all objects on image detected and actually can ‘navigate’ to them) by ‘skip connection’ techniques. Just to clarify, ‘skip connection’ term means not removing, but ‘skipping over’ in meaning linking an earlier layer with a current one. As output, we are getting binary mask which we may need for further process. In case we ask for rebuild image, some quality loses expected sue to ‘transformation’ processes.
Input for Stable Diffusion is noise, so UNet expects to receive the random pixels and rebuild them using users instructions (prompts, image, mask or ControlNet)