Data Day 2025 at Open University
Data Day 2025 at Open University
Multiplying real dataset with GenAI
Research goal
Create training set in most accurate, short term and cost-effective way.
Methods
Let's assume we want to train a AI model which can recognize different types of fruits. We collected images of apples, but final training set should contain also oranges, pomegranate etc.
To expend training set we need to run external depth model to create depth maps of the collected images and instruct with prompt to generate on areas with depth > x instead of apple another fruit.
Another way, less time-consuming, but very precise - create binary mask of collected image by annotated polygon and then generate new fruit only on that certain parts of images.
Challenges
Interaction with any generative model requires some research about the model of that group and specific about that model you are going to work with.
For instance, we are going to create image-to-image pipe to generate orange image and should take into account that both original image description provided by CLIP and user prompt has weight.
Depend on higher weight of CLIP or prompt generated images might be very different.
Once we chose the model we are going to use, important to understand if that model familiar with fruits, or it recommends for cats-dogs images generation.
Not less important to investigate with which tags that model was training and use it in prompt like key to open the full potential of what we want to create using that model.
Findings
This way of multiplying real dataset by generative images allow creating large and very various training set.
Diffusion process
Data TLV 2025: conference notes
My first time at the Data TLV Conference happened to be on its 10th anniversary. The conference agenda includes sessions on data science, data engineering, and data administration.
I would like to leave here my notes per speech I attended:
only a fine-tuned model using biased, customized data from that project made a real impact on the results.
Data quality is the foundation of the entire process; every row should be accurate and meaningful before moving forward;
Calculating data quality score is one of the ways to define data quality
From 0 to 100: Live GenAI Solution in 30 Minutes by Assi Dahan and Eran Zehavi
Process of MAG - Meta (data) Augmented Generation:
Running once over the data with a set of instruction
Creating a Meta data per page
Retrieving per parameter the pages/paragraph that contains relevant data
Send to an LLM with the relevant prompt
implement no code solutions;
vector similarity - obviously similar or same vectors mean objects of the same group or same object
impact of AI will reduce 80% of insurance issues by the end of 2045
combination of Data , Security and AI make chenges and have impact
Data Harmonization is
Single Source of Truth
Accelerated Al Readiness
Seamless Integration
Enhanced Compliance & Security
MCP - model context protocol which join LLM with every platform and source
Two main rules:
1 number = 1 graph (1 row and 1 column represented one values)
data - pixel (piece of information)
The easiest visualization is the clearest
Minimize number of subgraphs and complicated graphs on one plot
Do not need to add bar value on each bar, this is loaded and to explain that values are graph axes
Header clear and full of sense
Every piece of information on graph should be explained
Maximum 4 digit nums to show in graph - the best for human brain
Lines in the area chart can start not from 0, but in bar chart it must start fom 0.
If we dont have a choice and need to show few lines on one graph, colorize only most important axe and the rest leave as grey, to bold the most interesting one.
The key to high-quality annotated data is a thorough understanding of the project's requirements, goals, and KPIs
Types of annotation team:
In-house
Outsource
Crowdsource
Combined
one size doesn't fit all;
wrong DB architecture limits data volume services;
do not choose Data Base just because you were familiar with it in the past. Choose the best DB for project needs;
CAP theorem - AP and CP mode;
not every data type can enter to relational DB;
there are also noSQL which mainly support key:value data format (wide column stories);
bad practices to support few DB of different types (like relational and not relational) at the same time.
Leveraging LLMs for Efficient and Accurate Data Management
Leveraging LLMs for Efficient and Accurate Data Management: model customization
How to clean and manage database without the help of an annotator or colleagues? What if you could try using an LLM/NLP to assist you?
I've created an Inventors database that includes a column with all possible tags in a single string. Using this example, I will demonstrate how to effectively organize the data.
The Inventors dataset can be created by wrapping the rows into a Pandas DataFrame to be used for further processing.
inventors = [
"Ada-Lovelace-Computer-Programming-1843-No-UK",
"Thomas-Edison-USA-1879-No-Light-Bulb",
"Telephone-Alexander-Bell-1876-Scotland-Telephone-No",
"Marie-Curie-Yes-Poland-Radioactivity-1898-Yes",
"Guglielmo-Marconi-Radio-Italy-Yes-1895",
"James-Watt-1769-No-Scotland-Steam-Engine",
"Italy-Various-1490-No-Leonardo-da-Vinci",
"Germany-Johannes-Gutenberg-Printing-Press-1440-No",
"Tim-Berners-Lee-UK-World-Wide-Web-1989-No",
"Computer-Science-Alan-Turing-UK-1936-No",
"1976-No-Steve-Jobs-USA-Personal-Computer",
"Lamarr-Hedy-No-Austria-Wireless-Communication-1942",
"Sweden-Dynamite-Alfred-Nobel-1867-No",
"No-Wright-Brothers-USA-Airplane-1903",
]
df = pd.DataFrame(inventors, columns=["Inventor"])
The last two columns, 'Year of Invention' and 'Nobel Prize', can be extracted without using a language processing model.
The code snippet below demonstrates that:
The year value can be identified using the Re lib.
The Nobel Prize values are boolean ("Yes" or "No"), so easy to check them.
def preprocess_string(string: str):
"""I found out that splitting into small tokens is more efficients then one long input"""
parts = string.replace('-', ' ').replace('_', ' ').split()
return parts
def detect_objects_from_parts(parts: str):
objects = {'NobelPrize': False}
for part in parts:
if part == 'Yes':
objects['NobelPrize'] = 'Yes'
elif part == 'No':
objects['NobelPrize'] = 'No'
return objects
def process_nobel_prize(nobel_prize: str):
parts = preprocess_string(nobel_prize)
return pd.Series(detect_objects_from_parts(parts))
df[['NobelPrize']] = df["Inventor"].apply(lambda x: process_nobel_prize(str(x)))
def extract_year(string: str):
"""Extracts a 4-digit year from the string"""
parts = preprocess_string(string)
for part in parts:
if re.match(r"^\d{4}$", part):
return int(part)
return None
df["Year"] = df["Inventor"].apply(extract_year)
TIP: Improving Detection Accuracy
One useful technique I found is that removing detected parts from the original string makes subsequent extractions more precise.
Example:
Given the original string: "Ada-Lovelace-Computer-Programming-1843-No-UK"
From previous steps, it's known that:
"No" represents the Nobel Prize status.
"1843" is the year of invention.
By removing these detected parts and creating temporary string "Ada-Lovelace-Computer-Programming-UK" reduces input size, minimizing the chances of misdetection and improving extraction accuracy.
The cleaning logic is demonstrated in the piece of code below:
def clean_inventor_column(df, column_name):
df["Inventor"] = df.apply(
lambda row: "-".join(
[part for part in preprocess_string(row["Inventor"]) if part not in [str(row[column_name])]]), axis=1)
return df
The most effective way to identify names in a given string is by using Named Entity Recognition (NER) models.
I have experience with 'en_core_web_trf' and load it with spacy.
If you are going to use this library, I recommended to install 3.7.2 version and download model as:
python -m spacy download en_core_web_trf.
The code snippet below demonstrates how to apply an NER model to the Inventors database to find out the Name and Surname of the scientist:
trf = spacy.load('en_core_web_trf')
def extract_human_names(string: str):
parts = preprocess_string(string)
human_names = []
for part in parts:
doc = trf(part)
for ent in doc.ents:
if ent.label_ == 'PERSON':
human_names.append(ent.text)
return ', '.join(human_names)
df["inventor_names"] = df["Inventor"].apply(lambda x: extract_human_names(str(x)))
df[["Name", "Surname"]] = df["inventor_names"].apply(lambda x: pd.Series(x.split(", ")[:2] if x else ["", ""]))
df.drop(columns=["inventor_names"], inplace=True)
Last step is to extract the remaining details: 'Country' and 'Field', using a general LLM model, such as en_core_web_sm.
The big challenge is that the model (as whole AI) does not fully understand. In Inventors database, the meaning of 'Field' is 'field of research' and not as a piece of land.
There are two possible solutions:
Rename "Field" to improve model recognition
By using previous done workaround, clean the string after extract 'Country' and use the remaining part as 'Field' column value.
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern_country = [{"POS": "PROPN"}]
matcher.add("COUNTRY", [pattern_country])
def extract_country(video_id):
"""Extracts the country name from the inventor's video_id"""
parts = video_id.split("-")
doc = nlp(" ".join(parts))
matches = matcher(doc)
for match_id, start, end in matches:
return doc[start:end].text
return ""
def extract_field(inventor_string):
"""Extracts the field of invention from the cleaned Inventor column."""
parts = preprocess_string(inventor_string)
return " ".join(parts[1:])
df["Country"] = df["Inventor"].apply(extract_country)
clean_inventor_column(df, "Country")
df["Field"] = df["Inventor"].apply(extract_field)
print(df)
Output of the all previous steps looks 50/50. Obvious, some validations should be done.
On the given output, the two main things should be validated:
analyzing empties cells;
validating detection results to confirm their truthfulness.
How to analyze empty cells?
My usual approach to analyzing empty cells is to compare the length of the original input (the one from which detection was performed) with the sum of the lengths of the post-detected parts.
In example below, only first row meets the validation requirements. The rest do not, for various reasons (field not detected, surname is not common etc).
It might be a good identifier to find out outliers, but it is not enough.
Subset of detection output with adjusted first line
How to validate detection results?
There are two outputs: from previous validation passed and failed. Both of them should be included into second validation.
However, while the first row successfully passed the length validation, its content is entirely incorrect (or perhaps a new country, 'Computer Programming,' has been added to the map?). By applying a country-database library to this column, outliers—non-existent countries—can be detected.
Though it might seem simpler to deal with failures results, the only solution is to find a better model, improve detection, or fine-tune the current one which required additional effort.
Data management, particularly cleansing and refactoring databases, is often a dull and lengthy process, but it is crucial. With the adoption of LLMs and strong validation methods, this process can be automated, becoming more enjoyable, with data that is well-organized and accurate.
After applying LLM methods for annotation and classification out of the box in my previous article Leveraging LLMs for Efficient and Accurate Data Management, model customizing naturally becomes the next step to maximize detection accuracy.
Each and every model need to fed with data, in correct format, amount and structure.
Series of 'Person names' as in my DataFrame is not enough to fine tune spaCy model. Names they should be explicitly labeled. Moreover they should be putted into context, into sentences and then labeled, so spaCy can understand the world in context. In my case, I used only the label 'Person', but other labels can be used as well.
df = pd.DataFrame(inventors, columns=["Inventor"])
In order to annotate the dataset at scale, I leveraged three approaches: spaCy, GPT-2, and Zephyr GenAI models.
Creating a training set in spaCy starts with compiling a list of prompts, which should be predefined and then randomly enriched with names to generate a variety of labeled examples.
spicy_prompts = ["Assigned {name} to the ticket.",
"Spoke with {name} about the case.",
"{name} joined the weekly sync."]
spicy_training_data = []
for name in names:
sentence = random.choice(spicy_prompts).replace("{name}", name)
spicy_training_data.append((sentence, {"entities": [(sentence.index(name), start + len(name), "PERSON")]}))
Each and every model need to fed with data, in correct format, amount and structure.
Series of 'Person names' as in my DataFrame is not enough to fine tune spaCy model. Names they should be explicitly labeled. Moreover they should be putted into context, into sentences and then labeled, so spaCy can understand the world in context. In my case, I used only the label 'Person', but other labels can be used as well.
gpt_generator = pipeline("text-generation", model="gpt2")
gpt_training_data = []
for name in names:
result = generate_sentence_with_annotation(name) gpt_training_data.append(result)
zephyr_generator = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta")
zephyr_training_data = []
for name in names:
result = generate_sentence_with_annotation(name)
zephyr_training_data.append(result)
def generate_sentence_with_annotation(name, max_length=50):
prompt = name
output = generator(prompt, max_length=max_length, num_return_sequences=1, do_sample=True, temperature=0.9)[0]['generated_text']
sentence = output.split('.')[0].strip()
start = sentence.find(name)
if start == -1:
return None
end = start + len(name)
return (sentence, {"entities": [(start, end, "PERSON")]}
nlp = spacy.load("en_core_web_trf")
def save_to_spacy_format(gen_prompts, output_path):
nlp = spacy.blank("en")
doc_bin = DocBin()
for text, ann in gen_prompts:
doc = nlp.make_doc(text)
ents = []
for start, end, label in ann["entities"]:
span = doc.char_span(start, end, label=label)
if span:
ents.append(span)
doc.ents = ent
doc_bin.add(doc)
doc_bin.to_disk(output_path)
Models training
The config.cfg file in a spaCy project defines everything needed to train the NER model, including:
[paths] Input .spacy files (train/dev)
[nlp] Pipeline config (ner, tok2vec, etc.)
[components.ner] NER component with model architecture
[training] Epochs, optimizer, dropout, patience
[corpora.train] Tells spaCy to read .spacy training files
[system] Random seed and hardware config
To start train models:
datasets = {
"spacy": "spacy_train_set.spacy",
"gpt2": "gpt_train_set.spacy",
"zephyr": "zephyr_train_set.spacy"}
for name, dataset in datasets.items():
output = f"./output_{name}"
subprocess.run([
"python", "-m", "spacy", "train", config.cfg,
"--output", output,
"--paths.train", dataset,
"--paths.dev", dataset], check=True)
The last important step is to determine which of the new models is most suitable for the goal, by comparing them to each other and to the original model.
The evaluation set can be prepared in a similar way to the training set. For the spaCy workflow, the prompts should be new, while the GenAI models handle prompt variation on their side.
def evaluate_ner_model(nlp, sentences, target_label="PERSON"):
correct = 0
for name, sentence in sentences:
doc = nlp(sentence)
if any(ent.text == name and ent.label_ == target_label for ent in doc.ents):
correct += 1
return correct / len(sentences)
The heatmap shows that the Zephyr model delivered the best results, with GPT-2 ranking second.
Fine-tuning LLMs with different dataset preparation strategies shows clear trade-offs between accuracy, speed, and control. Zephyr offers the highest accuracy, GPT-2 is nearly good and advantage of both is efficient training dataset creation.
By boosting LLMs through efficient data management and smart customization, we can achieve exceptional accuracy and propel the whole process to new heights.
Data cleansing
The best data annotation and management tool
Do you remember that garbage in, garbage out?
This is the most important meaning of data cleaning process, which stands for identifying errors in data and repairing them. Here are some tips for cleaning data in the most effective way from my workday routine.
*My specialty is Computer Vision data management, so my examples mostly come from that area
Make a backup copy of data to be cleaned. Remember Murphy's Law and be a little paranoid to prevent very painful data losses.
From my experience:
not closing properly JSON file might caused to its erasing;
file renaming did not work as expected
Very important to understand mission of the data to be preparing and depends on it, select only relevant data fields.
It often happens that we collected more metadata than required. On the one hand, more metadata brings more openness. On the other hand, we are pulling metadata during whole pipe, that goes unused (but still need to be stored, validated etc)
From my experience:
many times I am creating “shrinker” metadata JSONs from original;
for efficient images processing especially crucial to consider both the size of the images and the size of the container (folder, Docker, etc.) where the images are stored. For instance, in a project where both RGB and depth images were collected, if the existing pipeline was unable to process one type of image, I prefer to proceed with type required at that moment and optimize processing by half.
It’s not a new idea that the cost of fixing a bug increases exponentially the later in development that bug is discovered.
Same with data cleansing! The issues detected in raw records or the errors caught on early stages of data preparation might prevent a double work and ensure high quality.
On those stages validation should be simple, but open-minded (and not always automated). It can be enough to open a few random metadata files, pay attention to file sizes, and review metadata.
If you have requirements or know which metadata fields are most required/needed, you can ensure that the data contains those fields in the required format.
I am usually working with images/videos, so my checklist:
[ ] Images are not corrupted, video is playable
[ ] Configuration and metadata files existed
[ ] Configuration and metadata files are in correct format and data types
[ ] Resolution matches the images/videos real resolution
[ ] Size of images/videos matches their resolution
This early exploration of data does not replace the validation pipeline but provides assurance that crucial fields will not become problematic in later stages.
Empties - there are metadata fields which contained values such as: None, N\A, NaN.
There are two possible reasons for empties:
noisy bug that erased actual field values;
metadata fields that are rarely used or not used at all (fields that are no longer in use, designed for future use and required for pipeline support).
Empties of the second type required memory, validation coverage, but not going to be used. They are redundancy and might be excluded.
From my experience:
empties could be fieIds’ which will be re-annotated in later stages of data processing
no new empties created after data processing (copying, generating etc)
During data cleaning, we often encounter issues, errors, or even bugs. Those records can cause significant delays in further data processing.
To avoid this, perform quick analysis of the amount those record and errors nature. Depending on the number, continue investigating of the error source or set aside the corrupted records temporarily and proceed with the clean data.
For instance, consider a batch containing 100 records. During the initial validation, 2 records have missing fields. In a later validation, 5 more records encounter errors. The remaining 93 records successfully pass all validations. Once 93% of data is processed, can concentrated with those 7%. In the worst-case scenario, if these records cannot be resolved, they can be excluded from the batch, the main batch is ready by the required deadlines.
Of course, behavior should be different if 50 or 70 % of records are corrupted.
From my experience:
Once I discover that some records are corrupted, I notify the relevant colleagues about the possibility of splitting the batch into two. This allows them to be prepared and plan accordingly (e.g., splitting for validation/training sets).
Before starting large batch processing, run it on a few records to ensure there are no unexpected issues. If any issues are discovered, plan your time and deadlines accordingly.
While working with large datasets, such as images or videos, it can be difficult to summarize all parts, especially in JSON or CSV rows. Visualizing metadata can help in understanding and covering the data more comprehensively.
Clean data is a sign of a quality product!
‘What is the best data annotation and management platform for you?’
Since I engage in data preparation for ML/AI, I still haven’t had the opportunity to use a platform that offers all the tools in one place.
My data management workflow typically begins by importing collecting images into my favorite annotation tools like CVAT, LabelMe, SuperAnnotate. Once the labeling is complete, I convert the annotation results from JSON to CSV for annotation validation and push them to a database. CSV files integrate well with Pandas and Pydantic, making metadata cleaning and validation more efficient. To summarize and present the data’s readiness on a dashboard like Redash and add some annotation examples on images usually with OpenCV.
I wish that best data annotation and management platform include:
direct access to images and metadata, with advanced filtering and grouping;
annotation tool that supports not only ‘classic’ BB and polygons, but more complex annotation types or has an API for customization;
post-visualisation and validation of annotations;
extracting annotations in both JSON and CSV formats;
data cleaning and validation pipelines;
connecting to database and pushing results into the tables.
In my view, new data annotation and management platforms should not only incorporate the latest tech gains but also anticipate future needs. I believe that 'classic' annotation types will become less common (detect face with BB is now a routine task compared to a few years ago). On other hands, more specialized annotations such as ellipses are still not widely supported (it widely used annotation type of human eye, detection of which remains a challenge).
Additionally, it’s crucial for annotation platforms to integrate the latest AI solutions, advanced object detection, and integration of pre-trained models—to make the annotation process more efficient and time-saving.
hayaData 2024:
conference notes
PyData 2024: conference notes
Conference hayaData has place in Tel Aviv in the last week of September 2024. It was really lucky to me attend the event and discover a lot of new stuff.
I would like to leave here my notes per speech I attended:
The Future of Data is Words by Josef Goldstein
metadata is new gold and companies accurately collect, save and use/reuse it are on the top;
prompting in sql queries (ai query) is new format of searching in tables
New terms: NSQ, semantic layer, RAG, self-served AI, sql AI query
Evaluating the Unseen: Supervised Evaluation for Unsupervised Algorithms by Ben Harel
understand the manual label groups and create suitable features for unsupervised learning
Tailor-Made LLM Evaluations: How to Create Custom Evaluations for your LLM by Linoy Cohen
benchmarking Leaderboard at Hugging Face simplify evaluation process by answering questions such as if training passed as excepted etc.
automate and versatile evaluation by using LLM-as-a-judge
LLM has bios such as position, verbosity, self-enhancement, authority
New terms: data contamination, benchmark frameworks
Metric Store by Ben Hababo and Mickey Rozen
data lake is highly common pool;
Monte-Carlo algorithms and metrics are widely used;
for same questions use AI solution such as Metaphor
New terms: Looks tool, Data Pulse, data democratization
Learning the Ropes of Synthetic Data by Noa Zamstein
image is collection of pixels in rows and columns (5): X, Y, R, G, B;
synthetic data can be created to present the real data cases especially for data privacy sensitive areas. Synthetic data is reference to original data, like created from that, but without any (privacy) connections;
combination of data: only one male in that cab or one unique name, so don’t need to specify, but can easy define by few parameters;
data minimization while analyzing, take only relevant to case fields;
data selection as hierarchy (like not heard attach, but heard syndrome, more generic);
correlation metrics between real and synthetic data;
synthetic data is a chance to create non existed things (like trees on empty roads etc)
New terms: Catalitic practicts, article
LLM and Knowledge Graphs: A case study in Blue(y) by Stav Shamir
trained model returns vector, our requests to trained vectors, most closed vectors are outputs to request;
vector embedding turns words to nums;
graphs with connections becomes complicated neutral networks;
schemas should not be complex (so you may not receive any answer) and no so easy (so you do not need LLM for that);
representing LLM connections as graphs help to see them as full pic
Live in Data Wild West: the data contracts sheriff by Tal Peretz
data issues brings real leaks and legalsituations;
data quality isthe most important things care about;
New terms: data contracts, DataHub (reviewer) , AvroScheme, source of truth, fields deprecations (silent vs …), scheme evolution, schema.yml, DBT expectation (tests on DB), assertion, elementary
Navigating the Uncharted: Ensuring Prompt Quality in the Age of Language Models by Ortal Ashkenazi
Several way to ensure prompt quality:
manually: efficient, but no fast ;
LLM-as-a-judge: fast, but depends on prompt and LLM version;
leveraging NLM for prompt verification, entity recognition:
cons: no models for every thing;
automate metrics validation;
unchecked prompts may contain hidden dangerous
New terms: NLP, NLM
Exploring the Depths of Apache Iceberg's Metadata Capabilities by Amit Gilad
there are another modern and cloud solutions to store data (data lakes) instead of backup disks
Threat Hunting Powered by Efficient and Straightforward anomaly detection on your data lake by Ori Nakar
anomalies are something that deviates from what is standard, normal, or expected, but should be define per specific case;
running query with json;
SQL still the faster way to detect the anomalies;
LLM with RAG (sequel statement) runs text queries without syntaxes dependence.
New terms: data scanning, CVE, DDoS
Dating with a super model: why good prompt engineering for data monitoring requires some flirting by Reut Vilek
no long prompts, break it into few smalls;
take it step by step;
give references, examples of sites, images etc;
parsing and analyze results to regenerate in better way
New terms: FML, NoCodeTool
Existed to try the discovers in my daily works!
After few postpones PyData finally hayaData has place in Tel Aviv on 4th of November 2024.
I would like to leave here my notes per speech I attended:
The Dangerous Data Anonymization by Ran Bar Zik
Anonymization means take a data and remove all private details
There are different anonymization techniques:
masking;
dynamic masking;
aggregation (age as range, general names);
pseudonymisation or reversible anonymization;
differentiational - add noise, mixed data;
more privacy-less accuracy;
"You are holding a data, think before release it!”
New terms: pii masking, AnonyPyx **python libs; Hippa, CCPA
Unveiling the Journey of Natural Language Processing (NLP): Milestones, Limitations, and Practical Applications by Ortal Ashkenazi
MLM masked language model;
transfer learning;
learning types: zero shot, one shot, few shots;
LLM has limiting for static data (if there are no slang words in training, model doesn’t understand request with that words);
RAG overcoming knowledge limitation;
Multimodal image integration
New terms: NLIK
A Shallow Introduction to Self-Attention by Alon Oring
Naive self-attention: how similar are embeddings (as word embeddings), score one to others;
Recurrent Neutral Network;
Contextualizing embedding
New terms: RNN, QKV
Securing Language Models Against Prompt Injection with the Powerful LangChain Framework by Michael Ethan Levinger
security through advising testing (LLM learns bad things;
Rebuff - moderation endpoint detecting and managing harmful content;
DAN stand for do anything now (in prompt meaning);
indirect prompt while loading from external source: from docs, cv;
if you are using and ask with ‘labels’ in model it interact better (IBM models example)
New terms: Lakera
Ibis framework - Making data science work at any scale by Omri Fima
https://github.com/thegreymatter/ibisframeworks allows with difference data source in one place (for instance join csv and table in database);
pip install ibis;
ibis.con.combile - describe query;
ibis udf
ibis.udf.scalar.pyarrow ->tokens
SQL[CRUD]-to-Text
I didn't realize that “SQL-to-Text” can only translate SELECT statements!
Until I tried to explain a complicated INSERT query that a user needed to confirm inside an annotation tool… and the models - HuggingFaceH4’s zephyr-7b-beta, Google’s T5 (even the fine-tuned mrm8488/t5-base-finetuned-wikiSQL) just answered: “What ...?” (as “what to SELECT?”).
That’s when I discovered - AI don’t even understand the basic SQL statements:
𝐂reate (Insert) → add new records into a table;
𝐑ead (Select) → read existing records;
𝐔pdate → change existing records;
𝐃elete → remove records.
To bridge this gap, I built a prompt-based solution per SQL statement. Each prompt tells the model what the statement really means, and then guides token generation - so the final output isn’t just a translation of syntax, but a clear explanation of what that specific query will do.
This way, annotators and users of different SQL levels can approve (or reject) queries in plain English and all teams align faster in project details without getting stuck in syntax details.
(CRUD)SQL-to-Text turns most complex queries into very simple insights and let us spend less time explaining SQL and more time deciding!
Try CRUD-SQL2Text workaround in action: https://lnkd.in/dQXbhftY