The main thing that I got out of RootsTech 2018 is the permission to think about Czech Handwritten Text Recognition as a real possibility. I have always been a little bit prone to being overly ambitious and having ridiculously huge dreams. A lot of people have derisively laughed at my vision in the past, and honestly it becomes really stiflingly lonely sometimes. The first presentation I went to was about big data and how machine learning can aggregate and analyze multiple datasets in different ways so that *poof* we can recreate the 1890 census, for example. After hearing that as a possibility – and not some far, distant possibility, but a very real one, I felt like I had permission to dream big.
This is about as big as I can dream:
I went to BYU Family History Technology Lab‘s presentation (which you can see here). My husband took classes from all of those professors when he was getting his BS in bionformatics, by the way. Most of what the lab is doing is lowering the threshold of entry into family history by gamifying it, which was kind of cool. And I really do agree that Relative Finder and FamilySearch’s very similar new “relatives near me” feature are the “gateway drug” into family history.
But the really great part of their presentation had nothing to do with a little character walking around the archive and the cemetery gathering records, or solving crossword puzzles or word searches, or playing knock-offs of Jeopardy and Wheel of Fortune. No, the best part by far was what they are doing with handwritten text recognition, and this is not even featured on their site yet.
I’m not going to lie, I love transcriptions of old text. In the past when I’ve thought about HTR for my complicated Czech and German texts, I’ve always scoffed and snobbily thought that there’s no way the computer would ever be better than me. You should note that when I index records with FamilySearch, I consistently get 70% or lower on the level 1 records (which is abysmal), and 96-99% on the level 5 records, which is somewhat hilarious. I’m pretty good at the squiggles. People pay me to do it and the truth is I really enjoy it. I think of all the mistakes made with OCR and laugh at the idea of handwriting recognition software being a possibility. “Yeah, in 20 years maybe.”
Then I saw this slide and it changed my life.
(sorry, hopefully the copyright police won’t hate me for sharing this. But it just might change your life, too.)
So, the BYU CS major students were in a competition with a terribly long name called ICDAR 2017 Handwritten Text Recognition, and apparently their tool was able to do this perfect automated transcription. BYU’s median character error rate (7%) was significantly lower than the next closest competitor (25%). Its word error rate was 16.8%, and the next closest competitor was 41.4%.
What the heck ICDAR stands for, I have not been able to figure out; computer scientists and scientists in general very commonly make this same mistake of writing in something that barely counts as English. From what I could understand, participants in this international competition used the same data set – the READ Dataset, which seems like it is not available for public use – to train a really complex computer algorithm to accurately recognize the handwriting. Machine learning.
Something like y = x(w) + b, where y is the thing you want, x is the input, b is the average, and w is the secret sauce.
My very, very cursory understanding is that you take this model, but then layer it upon itself and that is an extremely simplistic way of describing a neural network.
Something like y = x(w) + b, and then you feed the y into another layer as the input over and over again, until finally you get the most beautiful, accurate, perfect final y imaginable.
I guess the math itself isn’t that hard, it’s trying to figure out how to set up the dang problem. Trying to discover what are the relevant w’s (because, by the way, there can be like hundreds and hundreds of w’s).
Anyway, I’ll stop pretending like I understand machine learning now. I really don’t, and would have been a pitiful CS major. Something like: (my desire to learn and create interesting things)* (my interest in French, my interest in ASL, my interest in Arabic, my interest in biology, my interest in Czech, the lack of availability of Czech, my interest in cute boys who like to shower, my interest in being outside in the sun, my dislike of computer lab dungeons) + (my average capacity of solving problems in an averagely straightforward way i.e. not like a programmer, who typically cheats – oh sorry, I guess that’s called a “hack”) = a decision of a major between French Teaching, Middle Eastern Studies and Arabic, Photography, and Computer Science.
(a decision of a major between French Teaching, MESA, Photography, and Computer Science) * (the inevitability that I will probably not actually use my degree for gainful employment, the desire to interact with interesting humans, my love of cracking people who are geeky and difficult to crack, my desire to do something meaningful with my time, the weirdness aka coolness factor of Arabic, my utter hatred of the idea of using photography for gainful employment, the vague possibility that maybe someday I could be a spy, the desire to prove people wrong who laughed at me for thinking I could learn Arabic, the fact that French was super easy and I didn’t have to try very hard to get straight A’s and I still remember the one mistake I made on the pronunciation test, the fact that I’m actually really good at oral/aural language learning) + (my average capacity of solving problems in an averagely straightforward way i.e. I am really, really, really good at talking but not that good at solving complicated math problems) = a decision to be a MESA major.
A decision to be a MESA major * etc. etc.
~~~~~
Back to the blog post…
I was really, really excited to learn about BYU FHTL’s handwriting recognition project. Afterward I went to talk to them about what it would take for Czech HTR (handwritten text recognition) to become a reality. Later, I learned that these CS majors are set apart as church service missionaries (because that’s as close as FamilySearch can get to employing them, haha) and they are working on this technology for English and Spanish FamilySearch indexing projects.
There’s always going to be errors in HTR, but don’t you see that this would change everything? It really would change the way we do genealogy, possibly even more than DNA! Suddenly texts which take hours and hours to read and transcribe would be indexed and findable in seconds! I get shivers thinking about this because most of my time doing genealogy is spent staring at these squiggly lines, trying to decipher them into meaning. I love it, and I get paid to do it – but I’m SO EXCITED about this, I’ve basically thought of nothing else since I went to this presentation.
Oh, don’t worry, Czech is still going to be really difficult, even for HTR.
Of course it is. Czech has a complexity complex; it can’t ever just be simple 🙂 For example, this.
An exact wysiwyg transcription would be like this [I have bolded the non-standardized spelling]:
jest Jiří Vašíček rolu v dědině [V]Ratimově
ležící, na které prvotně Míka Šodek seděl
Hi Kate,
I am not sure those 10k Czech transcriptions would be necessary (or at least in that shape and form). How new machine learning algorithms work is that you take already finished network (e.g., the German one you talked about) and then you “retrain” it (it’s kind of complicated). Initial retraining can be done by computer itself; there are programs that can generate hand-written text (there are fonts from different time periods). Once you partially retrain the program (network) by these texts (and distort them), I think you already get a good product. At that point, human work will come handy. You let the network read your text, and correct it. I don’t think that many text would be necessary.
This is actually a pet project of mine for years (ok, I do something and in few months it is obsolete because of the advances in AI). Nonetheless, I have though a lot about it.
I would love nothing more than for you to be right about that! Meanwhile, I am trying to brainstorm some ideas about how to accomplish some sort of crowdsourced transcription project – be it 1,000 or 10,000, it would be a lot nicer to not work on it alone. I am quite sure that my capacity to invent new projects cannot keep up with my capacity to actually finish them, and that is definitely doubly or triply true when I’m pregnant and my energy level is much lower than normal. Basically, I think I have to finish some of my other projects before really committing to exploring this idea further…but it’s killing me…this would change EVERYTHING…