About this site
To be a Reader, requires vnderstanding; to be a Critike, iudgement. A Dictionarie giues armes to that, and takes no harme of this, if it mistake not. I wish thee both, but feare neither, for I still rest
The start of an idea
Greg Lindahl has provided a tremendous service with his hosted images and PDF of Florio's dictionary. However, even though his site provides links to major sections, and links to both navigate between pages and different sizes of the images, it can take some time to search for each word in a sentence someone may want to translate. Due to the evolution of languages, online translation tools often didn't quite do the job either.
I had been aware of progress on machine-learning translation models, in particular surprising work like Multi-Source Cross-Lingual Model Transfer: Learning What to Share (Chen et al., ACL 2019), where further training a translation model on one language yields improvements for other languages without much training data, even those in unrelated language families. I had an idea to leverage this to more easily create a translation model that could understand Early Modern Italian, however I would need text to feed it. I did in fact start the process of transcribing the scans hosted by Greg Lindahl, and transcribed about 2,400 words.
Transcribing is a long process, which wasn't at all helped by curiousity about old words or phrases that have long since fallen out of common use. For example, fee-farme (meaning a land or dwelling being rented), or neale red whot, which, after some searching, appears to mean "annealed red hot". As luck would have it, one of the few places that latter phrase appears online is in Project Gutenberg's transcription of Florio's dictionary. 🙌🙌🙌
The Gutenberg transcription certainly made the path easier. It was also around this time that Retrieval Augmented Generation (RAG) methods for Large Language Models (LLMs) were being created and popularised. I won't go too deep into what this is, aside from saying it is a way to allow LLMs like ChatGPT to access data and functionality outside its training sets. One of the common ways to provide data that an LLM (or any program) can quickly search for related text are vector databases.
I realised I could skip machine learning entirely by loading the definitions into a vector database, and then using that database for lookups. Even better, with a little work to "normalise" the contents (such as stripping accents/diacritics and punctuation), lookups should still be easy and very accurate, even with variations on spelling.
How I built the site
The first step was getting the data. This meant writing some code to parse the Gutenberg translation, and distinguish the Italian words from their definitions. Unfortunately, people in the 17th Century didn't give any thought to machine-readable formatting, even if Gutenberg provides both a HTML and plain-text (which looks similar to Markdown) format. After a couple months, several iterations each with some minor reworking of my parsing logic, and a handful of errata submissions, I had a program that could read the definitions, identify the Italian words and phrases, handle Florio's way of grouping conjugations into a single definition (e.g. Asseguíre, guísco, guíto becomes Asseguíre, Asseguísco, and Asseguíto), their definitions, Italian words and phrases within the definitions, and even write that out to an Excel spreadsheet. Run the Florio.Gutenberg.Exporter.Excel project in the Github repository if you'd like to also generate one and have a look at the results.
Now that I have code that can parse the Italian words and English definitions, what next? The goal is to put it all in a vector database,
so I need to generate vector embeddings for all of the Italian words. Each word is normalised (diacritics and transcription marks removed, and converted to lower-case)
and then split into unigrams and bigrams (each vector consisting of groups of one and two letters). This keeps the vector space small, and tolerates misspellings or spelling variants.
Using this, ML.NET can train an ONNX model that can be exported to a file (so I don't have to retrain it every time the application starts),
and can be used to generate the vectors required for populating the vector database.
For example, Stellẻ[o] (where [o] is a notation used by the transcribers to aid pronunciation) first becomes
stelleoafter being normalised. This is then split into the unigrams and bigrams
[s], [t], [e], [l], [l], [e], [o], [s, t], [t, e], [e, l], [l, l], [l, e], [e, o], [o, ]Searching the vector database easily locates this, even if it is slightly misspelled (e.g.
steleo; missing an l).
With all this, I can finally populate a vector database. I started with Qdrant (pronounced "quadrant", I think?),
but have switched over to Azure CosmosDb with its recently added vector support as it provides a (much cheaper) serverless option, and the ability to query using SQL.
However other vector databases could easily be used instead (and I even use the VolatileMemoryStore from the Semantic Kernel library in one of the test apps).
As I'm sure you've seen by trying a search, this has worked a spectacular treat.
Finíto.