Czech Handwritten Text Recognition: A Real Possibility

The main thing that I got out of RootsTech 2018 is the permission to think about Czech Handwritten Text Recognition as a real possibility. I have always been a little bit prone to being overly ambitious and having ridiculously huge dreams. A lot of people have derisively laughed at my vision in the past, and honestly it becomes really stiflingly lonely sometimes. The first presentation I went to was about big data and how machine learning can aggregate and analyze multiple datasets in different ways so that *poof* we can recreate the 1890 census, for example. After hearing that as a possibility - and not some far, distant possibility, but a very real one, I felt like I had permission to dream big.

This is about as big as I can dream:

I went to BYU Family History Technology Lab's presentation (which you can see here). My husband took classes from all of those professors when he was getting his BS in bionformatics, by the way. Most of what the lab is doing is lowering the threshold of entry into family history by gamifying it, which was kind of cool. And I really do agree that Relative Finder and FamilySearch's very similar new "relatives near me" feature are the "gateway drug" into family history.

But the really great part of their presentation had nothing to do with a little character walking around the archive and the cemetery gathering records, or solving crossword puzzles or word searches, or playing knock-offs of Jeopardy and Wheel of Fortune. No, the best part by far was what they are doing with handwritten text recognition, and this is not even featured on their site yet.

I'm not going to lie, I love transcriptions of old text. In the past when I've thought about HTR for my complicated Czech and German texts, I've always scoffed and snobbily thought that there's no way the computer would ever be better than me. You should note that when I index records with FamilySearch, I consistently get 70% or lower on the level 1 records (which is abysmal), and 96-99% on the level 5 records, which is somewhat hilarious. I'm pretty good at the squiggles. People pay me to do it and the truth is I really enjoy it. I think of all the mistakes made with OCR and laugh at the idea of handwriting recognition software being a possibility. "Yeah, in 20 years maybe."

Then I saw this slide and it changed my life.

(sorry, hopefully the copyright police won't hate me for sharing this. But it just might change your life, too.)

So, the BYU CS major students were in a competition with a terribly long name called ICDAR 2017 Handwritten Text Recognition, and apparently their tool was able to do this perfect automated transcription. BYU's median character error rate (7%) was significantly lower than the next closest competitor (25%). Its word error rate was 16.8%, and the next closest competitor was 41.4%.

What the heck ICDAR stands for, I have not been able to figure out; computer scientists and scientists in general very commonly make this same mistake of writing in something that barely counts as English. From what I could understand, participants in this international competition used the same data set - the READ Dataset, which seems like it is not available for public use - to train a really complex computer algorithm to accurately recognize the handwriting. Machine learning.

Something like y = x(w) + b, where y is the thing you want, x is the input, b is the average, and w is the secret sauce.

My very, very cursory understanding is that you take this model, but then layer it upon itself and that is an extremely simplistic way of describing a neural network.

Something like y = x(w) + b, and then you feed the y into another layer as the input over and over again, until finally you get the most beautiful, accurate, perfect final y imaginable.

I guess the math itself isn't that hard, it's trying to figure out how to set up the dang problem. Trying to discover what are the relevant w's (because, by the way, there can be like hundreds and hundreds of w's).

Anyway, I'll stop pretending like I understand machine learning now. I really don't, and would have been a pitiful CS major. Something like: (my desire to learn and create interesting things)* (my interest in French, my interest in ASL, my interest in Arabic, my interest in biology, my interest in Czech, the lack of availability of Czech, my interest in cute boys who like to shower, my interest in being outside in the sun, my dislike of computer lab dungeons) + (my average capacity of solving problems in an averagely straightforward way i.e. not like a programmer, who typically cheats - oh sorry, I guess that's called a "hack") = a decision of a major between French Teaching, Middle Eastern Studies and Arabic, Photography, and Computer Science.

(a decision of a major between French Teaching, MESA, Photography, and Computer Science) * (the inevitability that I will probably not actually use my degree for gainful employment, the desire to interact with interesting humans, my love of cracking people who are geeky and difficult to crack, my desire to do something meaningful with my time, the weirdness aka coolness factor of Arabic, my utter hatred of the idea of using photography for gainful employment, the vague possibility that maybe someday I could be a spy, the desire to prove people wrong who laughed at me for thinking I could learn Arabic, the fact that French was super easy and I didn't have to try very hard to get straight A's and I still remember the one mistake I made on the pronunciation test, the fact that I'm actually really good at oral/aural language learning) + (my average capacity of solving problems in an averagely straightforward way i.e. I am really, really, really good at talking but not that good at solving complicated math problems) = a decision to be a MESA major.

A decision to be a MESA major * etc. etc.


Back to the blog post...

I was really, really excited to learn about BYU FHTL's handwriting recognition project. Afterward I went to talk to them about what it would take for Czech HTR (handwritten text recognition) to become a reality. Later, I learned that these CS majors are set apart as church service missionaries (because that's as close as FamilySearch can get to employing them, haha) and they are working on this technology for English and Spanish FamilySearch indexing projects.

There's always going to be errors in HTR, but don't you see that this would change everything? It really would change the way we do genealogy, possibly even more than DNA! Suddenly texts which take hours and hours to read and transcribe would be indexed and findable in seconds! I get shivers thinking about this because most of my time doing genealogy is spent staring at these squiggly lines, trying to decipher them into meaning. I love it, and I get paid to do it - but I'm SO EXCITED about this, I've basically thought of nothing else since I went to this presentation.

Oh, don't worry, Czech is still going to be really difficult, even for HTR.

Of course it is. Czech has a complexity complex; it can't ever just be simple 🙂 For example, this.

One huge problem is that there is not currently information published in English about transcription and transliteration standards of old Czech documents, or if there is, I have not found it yet. This is an entire field of study in universities in the Czech republic, but I am unsure if they even have universal standards across the country. Communism probably stunted the field's growth, and it certainly is the reason reliable texts on this subject do not yet exist in English.
Reliable texts on most things related to Czech history do not exist in English, and it sucks so much. Bleaurgh.
One of the first steps is to find out if transcription standards exist, and if so what they are.
Any useful HTR tool for Czech really needs to incorporate a secondary tool which will transform the wysiwyg transcription into a transliteration with modern standardized Czech spelling.Czech spelling was not standardized until the late 19th/early 20th century around the time when Jan Otto's encyclopedia was published. It's not just a thorn th or a dagger y every once in a while; it is literally almost every high frequency letter which must transform in order to even hope to understand the meaning. And who cares about HTR if it doesn't help you get closer to unlocking the meaning of the text.

ss and sch to š
cz and čz and cž to č
rz and řz and rž to ř
j to í and i
ie to ě
au to ou
I will illustrate what I mean. Here is a a 1676 record from the Estate of Hluboká nad Vltavou, specifically from the village of Chlumec which today is in the České Budějovice District, South Bohemia Region.
An exact wysiwyg transcription would be like this [I have bolded the non-standardized spelling]:
Leta Panie " 1676: Ajal on Tento
grunt w summie za....275
Ma Ročznie platitj po...5 ßm[?].
This is not modern Czech. Modern Czech would be
Léta paně " 1676: Uja on tento
grunt v sumě za...275
Ma ročně platiti po...5 ßm[? again I'm not sure what this last symbol is; it looks like the kind of squiggly line they sometimes use to represent an abbreviated word. This has no equivalent symbol in English or Czech and a possible standard for transcription seems to be that "." is used to represent it, but I am really not sure about this.]
From the opposite end of the country in Vratimov, near Ostrava, here is a land record for my ancestor Jiří Vašíček. It was probably a ca ~1765 copy of a 1742 contract.
Wysiwyg, bolded the nonstandardized spelling:
Leta Panie 1742. dnie 2ho february oukopjl
jest Jiržj Wassicžek Rolu v diedinie Rattimovie
lezicžj, na ktere prvotnie Mjka Ssadek sediel
od vrchnosti za Koupni Summu 83 Th Sleu[that weird symbol for abbreviations]
Léta Páně 1742 dne 2ho february oukoupil
jest Jiří Vašíček rolu v dědině [V]Ratimově
ležící, na které prvotně Míka Šodek seděl
od vrchnosti za kupní sumu 83 thl sle[zské]
In the year of our Lord 1742, [on] the 2nd day of February[,]
Jiří Wašiček bought from the manorial lord an arable piece of land
in the hamlet of Vratimov [where] initially sat Myka Šodek,
for a purchase price of 83 Silesian thalers.
Fortunately this becomes less of an issue in the late 19th century. Here's an 1879 application for a passport from the District office of Místek which I photographed 3 months ago in SOkA Frýdek-Místek.

wysiwyg transcription:
Má matka v Americe, jménem Anna
Steffek, vyplatila mému synovi cestu
do Ameriky a žádá by muj syn Robert
Srkala za ní přišel a že mu tam do”
konalé postavení zappatří.
Poněwadž já chudobná vdova
jsem a synovi to dáti nemohu co
můj bratr a má matka v Americe,
tedy já prosim:
Notice that in the word "poněwadž" is spelled today "poněvadž."
By the way, here is a translation:
My mother in America, named Anna
Steffek, paid for the journey to America for my son
and requested that my son Robert
Srkal come to her and that a perfect
position will belong to him there.
Because I am a poor widow
and I cannot give to my son
what my brother and my mother in America [can give],
therefore I ask:
and it goes on to request the district office give him a passport to travel to America.
Native Czechs struggle to read the wysiwyg transcription the same way as native English speakers would struggle to make sense of Ye Olde English - perhaps even slightly more, since nearly every high-frequency phoneme is spelled differently in Ye Olde Czech. Non-Czech speakers will struggle a lot with just a wysiwyg transcription because you simply cannot put that output into any kind of online translator and have it make any kind of sense at all. Like, literally, at all.
The good news is that ~half of all historic Czech records were written in German, or some kind of combination of Czech/German. Remember, until 1918 the Czech lands were part of the Austrian (1806-1867) and Austro-Hungarian (1867-1918) Empire so the administrative language was German.  These records do not have this same transliteration problem because German spelling was standardized much earlier.
After a lot of exploring on this website I finally unburied this other website called Transkribus. I spent all afternoon trying to figure it out, and it was a pretty steep learning curve, I have to say. But basically, this is the exact software I need to use to make a Czech transcription training set a reality. Actually, the folks at Transkribus do that part for you.
- You create an account.
- You download the software and login.
- You upload your documents to the Transkribus server.
- You determine the segmentation of your document, although it's pretty fast with their automated tool.
- You transcribe 50-75 documents.
- You submit your documents to Transkribus. They will use your transcriptions to create a training model for HTR of similar texts.
- You can now use the trained model to transcribe other texts.
By the way, here's what a Czech document looks like when transcribed with a German Kurrent training model already available on Transkribus. Try not to laugh too hard:

Here's an example of useless HTR.

Why did John Huff from FamilySearch say it would take 10,000 NEARLY PERFECT Czech transcriptions to create a usable training set for this project? I am not exactly sure where that number comes from, but I believe that he's probably right.
This is a really complicated project with a super steep learning curve. I am trying to figure some of the logistics of it out before I start calling out to the world to try to help me accomplish this enormous task. Because this is never going to happen without crowd-sourcing the load. I mean, if I wrote one perfect transcription every day except Sunday, it would take me 31+ years to create a training set for Czech HTR, and I'd be 62 years old. The only way to get this done is to collaborate. I will continue to post about this as I learn more. Like I said, it's nearly all I can think about. I feel pretty annoyed right now that I didn't become a CS major, to be honest.

2 thoughts on “Czech Handwritten Text Recognition: A Real Possibility

  • Hi Kate,

    I am not sure those 10k Czech transcriptions would be necessary (or at least in that shape and form). How new machine learning algorithms work is that you take already finished network (e.g., the German one you talked about) and then you “retrain” it (it’s kind of complicated). Initial retraining can be done by computer itself; there are programs that can generate hand-written text (there are fonts from different time periods). Once you partially retrain the program (network) by these texts (and distort them), I think you already get a good product. At that point, human work will come handy. You let the network read your text, and correct it. I don’t think that many text would be necessary.

    This is actually a pet project of mine for years (ok, I do something and in few months it is obsolete because of the advances in AI). Nonetheless, I have though a lot about it.

    • I would love nothing more than for you to be right about that! Meanwhile, I am trying to brainstorm some ideas about how to accomplish some sort of crowdsourced transcription project – be it 1,000 or 10,000, it would be a lot nicer to not work on it alone. I am quite sure that my capacity to invent new projects cannot keep up with my capacity to actually finish them, and that is definitely doubly or triply true when I’m pregnant and my energy level is much lower than normal. Basically, I think I have to finish some of my other projects before really committing to exploring this idea further…but it’s killing me…this would change EVERYTHING…

Leave a Reply

Your email address will not be published. Required fields are marked *