Data Day 2025 at Open University

Multiplying real dataset with GenAI

Research goal

Create training set in most accurate, short term and cost-effective way.

Methods

Let's assume we want to train a AI model which can recognize different types of fruits. We collected images of apples, but final training set should contain also oranges, pomegranate etc.
To expend training set we need to run external depth model to create depth maps of the collected images and instruct with prompt to generate on areas with depth > x instead of apple another fruit.
Another way, less time-consuming, but very precise - create binary mask of collected image by annotated polygon and then generate new fruit only on that certain parts of images.

Challenges

Interaction with any generative model requires some research about the model of that group and specific about that model you are going to work with.
For instance, we are going to create image-to-image pipe to generate orange image and should take into account that both original image description provided by CLIP and user prompt has weight.
Depend on higher weight of CLIP or prompt generated images might be very different.
Once we chose the model we are going to use, important to understand if that model familiar with fruits, or it recommends for cats-dogs images generation.
Not less important to investigate with which tags that model was training and use it in prompt like key to open the full potential of what we want to create using that model.

Findings

This way of multiplying real dataset by generative images allow creating large and very various training set.

Diffusion process

Diffusion denoising process of new image generation

Data TLV 2025: conference notes

My first time at the Data TLV Conference happened to be on its 10th anniversary. The conference agenda includes sessions on data science, data engineering, and data administration.

I would like to leave here my notes per speech I attended:

Hacking Social Platforms with AI for Threat Intelligence by Inga Cherny:

only a fine-tuned model using biased, customized data from that project made a real impact on the results.

Deliver Trusted, AI-ready Data by Ori Kelsi

Data quality is the foundation of the entire process; every row should be accurate and meaningful before moving forward;
Calculating data quality score is one of the ways to define data quality

From 0 to 100: Live GenAI Solution in 30 Minutes by Assi Dahan and Eran Zehavi

Process of MAG - Meta (data) Augmented Generation:
1. Running once over the data with a set of instruction
2. Creating a Meta data per page
3. Retrieving per parameter the pages/paragraph that contains relevant data
4. Send to an LLM with the relevant prompt
implement no code solutions;
vector similarity - obviously similar or same vectors mean objects of the same group or same object

Why Security Fuels AI Innovation by Lahav Savir

impact of AI will reduce 80% of insurance issues by the end of 2045
combination of Data , Security and AI make chenges and have impact
Data Harmonization is
1. Single Source of Truth
2. Accelerated Al Readiness
3. Seamless Integration
4. Enhanced Compliance & Security
MCP - model context protocol which join LLM with every platform and source

Data Visualization Makeover: From Complex to Clear by בלה גרף

Two main rules:
1. 1 number = 1 graph (1 row and 1 column represented one values)
2. data - pixel (piece of information)
The easiest visualization is the clearest
Minimize number of subgraphs and complicated graphs on one plot
Do not need to add bar value on each bar, this is loaded and to explain that values are graph axes
Header clear and full of sense
Every piece of information on graph should be explained
Maximum 4 digit nums to show in graph - the best for human brain
Lines in the area chart can start not from 0, but in bar chart it must start fom 0.
If we dont have a choice and need to show few lines on one graph, colorize only most important axe and the rest leave as grey, to bold the most interesting one.

When Al Needs a Human Assistant: How to Build Data Annotation Operations by Danielle Menuhin, Liz Polansky

The key to high-quality annotated data is a thorough understanding of the project's requirements, goals, and KPIs
Types of annotation team:
- In-house
- Outsource
- Crowdsource
- Combined

Choosing the Right Database for the Right Use Case by Zohar Elkayam

one size doesn't fit all;
wrong DB architecture limits data volume services;
do not choose Data Base just because you were familiar with it in the past. Choose the best DB for project needs;
CAP theorem - AP and CP mode;
not every data type can enter to relational DB;
there are also noSQL which mainly support key:value data format (wide column stories);
bad practices to support few DB of different types (like relational and not relational) at the same time.

Leveraging LLMs for Efficient and Accurate Data Management

Leveraging LLMs for Efficient and Accurate Data Management: model customization

How to clean and manage database without the help of an annotator or colleagues? What if you could try using an LLM/NLP to assist you?

I've created an Inventors database that includes a column with all possible tags in a single string. Using this example, I will demonstrate how to effectively organize the data.

0. Creating example dataset

The Inventors dataset can be created by wrapping the rows into a Pandas DataFrame to be used for further processing.

inventors = [

"Ada-Lovelace-Computer-Programming-1843-No-UK",

"Thomas-Edison-USA-1879-No-Light-Bulb",

"Telephone-Alexander-Bell-1876-Scotland-Telephone-No",

"Marie-Curie-Yes-Poland-Radioactivity-1898-Yes",

"Guglielmo-Marconi-Radio-Italy-Yes-1895",

"James-Watt-1769-No-Scotland-Steam-Engine",

"Italy-Various-1490-No-Leonardo-da-Vinci",

"Germany-Johannes-Gutenberg-Printing-Press-1440-No",

"Tim-Berners-Lee-UK-World-Wide-Web-1989-No",

"Computer-Science-Alan-Turing-UK-1936-No",

"1976-No-Steve-Jobs-USA-Personal-Computer",

"Lamarr-Hedy-No-Austria-Wireless-Communication-1942",

"Sweden-Dynamite-Alfred-Nobel-1867-No",

"No-Wright-Brothers-USA-Airplane-1903",

]

df = pd.DataFrame(inventors, columns=["Inventor"])

1. Cleaning data in classic way

The last two columns, 'Year of Invention' and 'Nobel Prize', can be extracted without using a language processing model.

The code snippet below demonstrates that:

The year value can be identified using the Re lib.
The Nobel Prize values are boolean ("Yes" or "No"), so easy to check them.

def preprocess_string(string: str):

"""I found out that splitting into small tokens is more efficients then one long input"""

parts = string.replace('-', ' ').replace('_', ' ').split()

return parts

def detect_objects_from_parts(parts: str):

objects = {'NobelPrize': False}

for part in parts:

if part == 'Yes':

objects['NobelPrize'] = 'Yes'

elif part == 'No':

objects['NobelPrize'] = 'No'

return objects

def process_nobel_prize(nobel_prize: str):

parts = preprocess_string(nobel_prize)

return pd.Series(detect_objects_from_parts(parts))

df[['NobelPrize']] = df["Inventor"].apply(lambda x: process_nobel_prize(str(x)))

def extract_year(string: str):

"""Extracts a 4-digit year from the string"""

parts = preprocess_string(string)

for part in parts:

if re.match(r"^\d{4}$", part):

return int(part)

return None

df["Year"] = df["Inventor"].apply(extract_year)

TIP: Improving Detection Accuracy

One useful technique I found is that removing detected parts from the original string makes subsequent extractions more precise.

Example:

Given the original string: "Ada-Lovelace-Computer-Programming-1843-No-UK"

From previous steps, it's known that:

"No" represents the Nobel Prize status.
"1843" is the year of invention.

By removing these detected parts and creating temporary string "Ada-Lovelace-Computer-Programming-UK" reduces input size, minimizing the chances of misdetection and improving extraction accuracy.

The cleaning logic is demonstrated in the piece of code below:

def clean_inventor_column(df, column_name):

df["Inventor"] = df.apply(

lambda row: "-".join(

[part for part in preprocess_string(row["Inventor"]) if part not in [str(row[column_name])]]), axis=1)

return df

2. Names extraction using (NER)

The most effective way to identify names in a given string is by using Named Entity Recognition (NER) models.

I have experience with 'en_core_web_trf' and load it with spacy.

If you are going to use this library, I recommended to install 3.7.2 version and download model as:

python -m spacy download en_core_web_trf.

The code snippet below demonstrates how to apply an NER model to the Inventors database to find out the Name and Surname of the scientist:

trf = spacy.load('en_core_web_trf')

def extract_human_names(string: str):

parts = preprocess_string(string)

human_names = []

for part in parts:

doc = trf(part)

for ent in doc.ents:

if ent.label_ == 'PERSON':

human_names.append(ent.text)

return ', '.join(human_names)

df["inventor_names"] = df["Inventor"].apply(lambda x: extract_human_names(str(x)))

df[["Name", "Surname"]] = df["inventor_names"].apply(lambda x: pd.Series(x.split(", ")[:2] if x else ["", ""]))

df.drop(columns=["inventor_names"], inplace=True)

3. Enhancing Detection with an LLM

Last step is to extract the remaining details: 'Country' and 'Field', using a general LLM model, such as en_core_web_sm.

The big challenge is that the model (as whole AI) does not fully understand. In Inventors database, the meaning of 'Field' is 'field of research' and not as a piece of land.

There are two possible solutions:

Rename "Field" to improve model recognition
By using previous done workaround, clean the string after extract 'Country' and use the remaining part as 'Field' column value.

nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)

pattern_country = [{"POS": "PROPN"}]

matcher.add("COUNTRY", [pattern_country])

def extract_country(video_id):

"""Extracts the country name from the inventor's video_id"""

parts = video_id.split("-")

doc = nlp(" ".join(parts))

matches = matcher(doc)

for match_id, start, end in matches:

return doc[start:end].text

return ""

def extract_field(inventor_string):

"""Extracts the field of invention from the cleaned Inventor column."""

parts = preprocess_string(inventor_string)

return " ".join(parts[1:])

df["Country"] = df["Inventor"].apply(extract_country)

clean_inventor_column(df, "Country")

df["Field"] = df["Inventor"].apply(extract_field)

print(df)

Output of the all previous steps looks 50/50. Obvious, some validations should be done.

4. Analyzing misdetections and validating detections

On the given output, the two main things should be validated:

analyzing empties cells;
validating detection results to confirm their truthfulness.

How to analyze empty cells?

My usual approach to analyzing empty cells is to compare the length of the original input (the one from which detection was performed) with the sum of the lengths of the post-detected parts.

In example below, only first row meets the validation requirements. The rest do not, for various reasons (field not detected, surname is not common etc).

It might be a good identifier to find out outliers, but it is not enough.

Subset of detection output with adjusted first line

How to validate detection results?

There are two outputs: from previous validation passed and failed. Both of them should be included into second validation.

However, while the first row successfully passed the length validation, its content is entirely incorrect (or perhaps a new country, 'Computer Programming,' has been added to the map?). By applying a country-database library to this column, outliers—non-existent countries—can be detected.

Though it might seem simpler to deal with failures results, the only solution is to find a better model, improve detection, or fine-tune the current one which required additional effort.

Last Considerations

Data management, particularly cleansing and refactoring databases, is often a dull and lengthy process, but it is crucial. With the adoption of LLMs and strong validation methods, this process can be automated, becoming more enjoyable, with data that is well-organized and accurate.

Ready-to-go code on GitHub

After applying LLM methods for annotation and classification out of the box in my previous article Leveraging LLMs for Efficient and Accurate Data Management, model customizing naturally becomes the next step to maximize detection accuracy.

Data preparation

Each and every model need to fed with data, in correct format, amount and structure.

Series of 'Person names' as in my DataFrame is not enough to fine tune spaCy model. Names they should be explicitly labeled. Moreover they should be putted into context, into sentences and then labeled, so spaCy can understand the world in context. In my case, I used only the label 'Person', but other labels can be used as well.

df = pd.DataFrame(inventors, columns=["Inventor"])

In order to annotate the dataset at scale, I leveraged three approaches: spaCy, GPT-2, and Zephyr GenAI models.

spiCy with manual prompt

Creating a training set in spaCy starts with compiling a list of prompts, which should be predefined and then randomly enriched with names to generate a variety of labeled examples.

spicy_prompts = ["Assigned {name} to the ticket.",

"Spoke with {name} about the case.",

"{name} joined the weekly sync."]

spicy_training_data = []

for name in names:

sentence = random.choice(spicy_prompts).replace("{name}", name)

spicy_training_data.append((sentence, {"entities": [(sentence.index(name), start + len(name), "PERSON")]}))

GPT2 and Zephyr with generated prompts

Each and every model need to fed with data, in correct format, amount and structure.

gpt_generator = pipeline("text-generation", model="gpt2")

gpt_training_data = []

for name in names:

result = generate_sentence_with_annotation(name) gpt_training_data.append(result)

zephyr_generator = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta")

zephyr_training_data = []

for name in names:

result = generate_sentence_with_annotation(name)

zephyr_training_data.append(result)

def generate_sentence_with_annotation(name, max_length=50):

prompt = name

output = generator(prompt, max_length=max_length, num_return_sequences=1, do_sample=True, temperature=0.9)[0]['generated_text']

sentence = output.split('.')[0].strip()

start = sentence.find(name)

if start == -1:

return None

end = start + len(name)

return (sentence, {"entities": [(start, end, "PERSON")]}

Wrapping prompts into spaCy format dataset:

nlp = spacy.load("en_core_web_trf")

def save_to_spacy_format(gen_prompts, output_path):

nlp = spacy.blank("en")

doc_bin = DocBin()

for text, ann in gen_prompts:

doc = nlp.make_doc(text)

ents = []

for start, end, label in ann["entities"]:

span = doc.char_span(start, end, label=label)

if span:

ents.append(span)

doc.ents = ent

doc_bin.add(doc)

doc_bin.to_disk(output_path)

Seems all ready to start training NER models on three spaCy-formatted datasets from spaCy, GPT-2, and Zephyr.

Models training

The config.cfg file in a spaCy project defines everything needed to train the NER model, including:

[paths] Input .spacy files (train/dev)
[nlp] Pipeline config (ner, tok2vec, etc.)
[components.ner] NER component with model architecture
[training] Epochs, optimizer, dropout, patience
[corpora.train] Tells spaCy to read .spacy training files
[system] Random seed and hardware config

To start train models:

datasets = {

"spacy": "spacy_train_set.spacy",

"gpt2": "gpt_train_set.spacy",

"zephyr": "zephyr_train_set.spacy"}

for name, dataset in datasets.items():

output = f"./output_{name}"

subprocess.run([

"python", "-m", "spacy", "train", config.cfg,

"--output", output,

"--paths.train", dataset,

"--paths.dev", dataset], check=True)

The output directory contains model-best and model-last subdirectories, where model-best holds the highest-scoring model ready for use, and model-last is useful for debugging or additional fine-tuning.

Evaluation

The last important step is to determine which of the new models is most suitable for the goal, by comparing them to each other and to the original model.

The evaluation set can be prepared in a similar way to the training set. For the spaCy workflow, the prompts should be new, while the GenAI models handle prompt variation on their side.

def evaluate_ner_model(nlp, sentences, target_label="PERSON"):

correct = 0

for name, sentence in sentences:

doc = nlp(sentence)

if any(ent.text == name and ent.label_ == target_label for ent in doc.ents):

correct += 1

return correct / len(sentences)

The heatmap shows that the Zephyr model delivered the best results, with GPT-2 ranking second.

Concluding idea

Fine-tuning LLMs with different dataset preparation strategies shows clear trade-offs between accuracy, speed, and control. Zephyr offers the highest accuracy, GPT-2 is nearly good and advantage of both is efficient training dataset creation.

By boosting LLMs through efficient data management and smart customization, we can achieve exceptional accuracy and propel the whole process to new heights.

Data cleansing

The best data annotation and management tool

What is this about?

Do you remember that garbage in, garbage out?

This is the most important meaning of data cleaning process, which stands for identifying errors in data and repairing them. Here are some tips for cleaning data in the most effective way from my workday routine.

*My specialty is Computer Vision data management, so my examples mostly come from that area

Backup before starting

Make a backup copy of data to be cleaned. Remember Murphy's Law and be a little paranoid to prevent very painful data losses.

From my experience:

not closing properly JSON file might caused to its erasing;
file renaming did not work as expected

Use relevant metadata

Very important to understand mission of the data to be preparing and depends on it, select only relevant data fields.

It often happens that we collected more metadata than required. On the one hand, more metadata brings more openness. On the other hand, we are pulling metadata during whole pipe, that goes unused (but still need to be stored, validated etc)

From my experience:

many times I am creating “shrinker” metadata JSONs from original;
for efficient images processing especially crucial to consider both the size of the images and the size of the container (folder, Docker, etc.) where the images are stored. For instance, in a project where both RGB and depth images were collected, if the existing pipeline was unable to process one type of image, I prefer to proceed with type required at that moment and optimize processing by half.

Explore the data

It’s not a new idea that the cost of fixing a bug increases exponentially the later in development that bug is discovered.

Same with data cleansing! The issues detected in raw records or the errors caught on early stages of data preparation might prevent a double work and ensure high quality.

On those stages validation should be simple, but open-minded (and not always automated). It can be enough to open a few random metadata files, pay attention to file sizes, and review metadata.

If you have requirements or know which metadata fields are most required/needed, you can ensure that the data contains those fields in the required format.

I am usually working with images/videos, so my checklist:

[ ] Images are not corrupted, video is playable
[ ] Configuration and metadata files existed
[ ] Configuration and metadata files are in correct format and data types
[ ] Resolution matches the images/videos real resolution
[ ] Size of images/videos matches their resolution

This early exploration of data does not replace the validation pipeline but provides assurance that crucial fields will not become problematic in later stages.

No place for empties

Empties - there are metadata fields which contained values such as: None, N\A, NaN.

There are two possible reasons for empties:

noisy bug that erased actual field values;
metadata fields that are rarely used or not used at all (fields that are no longer in use, designed for future use and required for pipeline support).

Empties of the second type required memory, validation coverage, but not going to be used. They are redundancy and might be excluded.

From my experience:

empties could be fieIds’ which will be re-annotated in later stages of data processing
no new empties created after data processing (copying, generating etc)

Do not clean the corners

During data cleaning, we often encounter issues, errors, or even bugs. Those records can cause significant delays in further data processing.

To avoid this, perform quick analysis of the amount those record and errors nature. Depending on the number, continue investigating of the error source or set aside the corrupted records temporarily and proceed with the clean data.

For instance, consider a batch containing 100 records. During the initial validation, 2 records have missing fields. In a later validation, 5 more records encounter errors. The remaining 93 records successfully pass all validations. Once 93% of data is processed, can concentrated with those 7%. In the worst-case scenario, if these records cannot be resolved, they can be excluded from the batch, the main batch is ready by the required deadlines.

Of course, behavior should be different if 50 or 70 % of records are corrupted.

From my experience:

Once I discover that some records are corrupted, I notify the relevant colleagues about the possibility of splitting the batch into two. This allows them to be prepared and plan accordingly (e.g., splitting for validation/training sets).

Take a bite

Before starting large batch processing, run it on a few records to ensure there are no unexpected issues. If any issues are discovered, plan your time and deadlines accordingly.

Visualize

While working with large datasets, such as images or videos, it can be difficult to summarize all parts, especially in JSON or CSV rows. Visualizing metadata can help in understanding and covering the data more comprehensively.

Clean data is a sign of a quality product!

‘What is the best data annotation and management platform for you?’

Since I engage in data preparation for ML/AI, I still haven’t had the opportunity to use a platform that offers all the tools in one place.

My data management workflow typically begins by importing collecting images into my favorite annotation tools like CVAT, LabelMe, SuperAnnotate. Once the labeling is complete, I convert the annotation results from JSON to CSV for annotation validation and push them to a database. CSV files integrate well with Pandas and Pydantic, making metadata cleaning and validation more efficient. To summarize and present the data’s readiness on a dashboard like Redash and add some annotation examples on images usually with OpenCV.

I wish that best data annotation and management platform include:

direct access to images and metadata, with advanced filtering and grouping;
annotation tool that supports not only ‘classic’ BB and polygons, but more complex annotation types or has an API for customization;
post-visualisation and validation of annotations;
extracting annotations in both JSON and CSV formats;
data cleaning and validation pipelines;
connecting to database and pushing results into the tables.

In my view, new data annotation and management platforms should not only incorporate the latest tech gains but also anticipate future needs. I believe that 'classic' annotation types will become less common (detect face with BB is now a routine task compared to a few years ago). On other hands, more specialized annotations such as ellipses are still not widely supported (it widely used annotation type of human eye, detection of which remains a challenge).

Additionally, it’s crucial for annotation platforms to integrate the latest AI solutions, advanced object detection, and integration of pre-trained models—to make the annotation process more efficient and time-saving.

hayaData 2024:

conference notes

PyData 2024: conference notes

Conference hayaData has place in Tel Aviv in the last week of September 2024. It was really lucky to me attend the event and discover a lot of new stuff.

I would like to leave here my notes per speech I attended:

The Future of Data is Words by Josef Goldstein

metadata is new gold and companies accurately collect, save and use/reuse it are on the top;
prompting in sql queries (ai query) is new format of searching in tables

New terms: NSQ, semantic layer, RAG, self-served AI, sql AI query

Evaluating the Unseen: Supervised Evaluation for Unsupervised Algorithms by Ben Harel

understand the manual label groups and create suitable features for unsupervised learning

Tailor-Made LLM Evaluations: How to Create Custom Evaluations for your LLM by Linoy Cohen

benchmarking Leaderboard at Hugging Face simplify evaluation process by answering questions such as if training passed as excepted etc.
automate and versatile evaluation by using LLM-as-a-judge
LLM has bios such as position, verbosity, self-enhancement, authority

New terms: data contamination, benchmark frameworks

Metric Store by Ben Hababo and Mickey Rozen

data lake is highly common pool;
Monte-Carlo algorithms and metrics are widely used;
for same questions use AI solution such as Metaphor

New terms: Looks tool, Data Pulse, data democratization

Learning the Ropes of Synthetic Data by Noa Zamstein

image is collection of pixels in rows and columns (5): X, Y, R, G, B;
synthetic data can be created to present the real data cases especially for data privacy sensitive areas. Synthetic data is reference to original data, like created from that, but without any (privacy) connections;
combination of data: only one male in that cab or one unique name, so don’t need to specify, but can easy define by few parameters;
data minimization while analyzing, take only relevant to case fields;
data selection as hierarchy (like not heard attach, but heard syndrome, more generic);
correlation metrics between real and synthetic data;
synthetic data is a chance to create non existed things (like trees on empty roads etc)

New terms: Catalitic practicts, article

LLM and Knowledge Graphs: A case study in Blue(y) by Stav Shamir

trained model returns vector, our requests to trained vectors, most closed vectors are outputs to request;
vector embedding turns words to nums;
graphs with connections becomes complicated neutral networks;
schemas should not be complex (so you may not receive any answer) and no so easy (so you do not need LLM for that);
representing LLM connections as graphs help to see them as full pic

Live in Data Wild West: the data contracts sheriff by Tal Peretz

data issues brings real leaks and legalsituations;
data quality isthe most important things care about;

New terms: data contracts, DataHub (reviewer) , AvroScheme, source of truth, fields deprecations (silent vs …), scheme evolution, schema.yml, DBT expectation (tests on DB), assertion, elementary

Navigating the Uncharted: Ensuring Prompt Quality in the Age of Language Models by Ortal Ashkenazi

Several way to ensure prompt quality:

manually: efficient, but no fast ;
LLM-as-a-judge: fast, but depends on prompt and LLM version;
leveraging NLM for prompt verification, entity recognition:
- cons: no models for every thing;
automate metrics validation;
unchecked prompts may contain hidden dangerous

New terms: NLP, NLM

Exploring the Depths of Apache Iceberg's Metadata Capabilities by Amit Gilad

there are another modern and cloud solutions to store data (data lakes) instead of backup disks

Threat Hunting Powered by Efficient and Straightforward anomaly detection on your data lake by Ori Nakar

anomalies are something that deviates from what is standard, normal, or expected, but should be define per specific case;
running query with json;
SQL still the faster way to detect the anomalies;
LLM with RAG (sequel statement) runs text queries without syntaxes dependence.

New terms: data scanning, CVE, DDoS

Dating with a super model: why good prompt engineering for data monitoring requires some flirting by Reut Vilek

no long prompts, break it into few smalls;
take it step by step;
give references, examples of sites, images etc;
parsing and analyze results to regenerate in better way

New terms: FML, NoCodeTool

Existed to try the discovers in my daily works!

After few postpones PyData finally hayaData has place in Tel Aviv on 4th of November 2024.

I would like to leave here my notes per speech I attended:

The Dangerous Data Anonymization by Ran Bar Zik

Anonymization means take a data and remove all private details
There are different anonymization techniques:
- masking;
- dynamic masking;
- aggregation (age as range, general names);
- pseudonymisation or reversible anonymization;
- differentiational - add noise, mixed data;
more privacy-less accuracy;
"You are holding a data, think before release it!”

New terms: pii masking, AnonyPyx **python libs; Hippa, CCPA

Unveiling the Journey of Natural Language Processing (NLP): Milestones, Limitations, and Practical Applications by Ortal Ashkenazi

MLM masked language model;
transfer learning;
learning types: zero shot, one shot, few shots;
LLM has limiting for static data (if there are no slang words in training, model doesn’t understand request with that words);
RAG overcoming knowledge limitation;
Multimodal image integration

New terms: NLIK

A Shallow Introduction to Self-Attention by Alon Oring

Naive self-attention: how similar are embeddings (as word embeddings), score one to others;
Recurrent Neutral Network;
Contextualizing embedding

New terms: RNN, QKV

Securing Language Models Against Prompt Injection with the Powerful LangChain Framework by Michael Ethan Levinger

security through advising testing (LLM learns bad things;
Rebuff - moderation endpoint detecting and managing harmful content;
DAN stand for do anything now (in prompt meaning);
indirect prompt while loading from external source: from docs, cv;
if you are using and ask with ‘labels’ in model it interact better (IBM models example)

New terms: Lakera

Ibis framework - Making data science work at any scale by Omri Fima

https://github.com/thegreymatter/ibisframeworks allows with difference data source in one place (for instance join csv and table in database);
pip install ibis;
ibis.con.combile - describe query;
ibis udf
ibis.udf.scalar.pyarrow ->tokens

SQL[CRUD]-to-Text

I didn't realize that “SQL-to-Text” can only translate SELECT statements!

Until I tried to explain a complicated INSERT query that a user needed to confirm inside an annotation tool… and the models - HuggingFaceH4’s zephyr-7b-beta, Google’s T5 (even the fine-tuned mrm8488/t5-base-finetuned-wikiSQL) just answered: “What ...?” (as “what to SELECT?”).

That’s when I discovered - AI don’t even understand the basic SQL statements:

𝐂reate (Insert) → add new records into a table;

𝐑ead (Select) → read existing records;

𝐔pdate → change existing records;

𝐃elete → remove records.

To bridge this gap, I built a prompt-based solution per SQL statement. Each prompt tells the model what the statement really means, and then guides token generation - so the final output isn’t just a translation of syntax, but a clear explanation of what that specific query will do.

This way, annotators and users of different SQL levels can approve (or reject) queries in plain English and all teams align faster in project details without getting stuck in syntax details.

(CRUD)SQL-to-Text turns most complex queries into very simple insights and let us spend less time explaining SQL and more time deciding!

Try CRUD-SQL2Text workaround in action: https://lnkd.in/dQXbhftY

Page updated

Google Sites

Report abuse