15 years of the Known Words Test, now with Vietnamese!

Back in March 2011, I wrote a web application to estimate how many Chinese words a person knows. The test works similar to a flashcard set: for each word presented, you mark whether you know the word (using your own judgment), optionally revealing the pinyin and English definition before deciding. A sample of 165 words is drawn from the top 36,000 in the Lancaster Corpus of Mandarin Chinese. At the end, the results are extrapolated across frequency bands to give an estimated vocabulary size. A Chinese character test was added later, with the same methodology, but at the character level rather than the word level. The app has been quietly running for 15 years now.

The app now has a new test – Vietnamese! It’s been something I’ve wanted to add since starting my Vietnamese studies last year, to get an estimate of my progress at various points. The test follows the same format as the Chinese ones – a sample of words drawn from a frequency list, marked known or unknown, with a vocabulary estimate at the end. Before the main test, there is a short pre-test with a few questions to gather some anonymous information about your current level of Vietnamese. This data will be used for future analysis, similar to the original research motivation behind the Chinese test. If you are a Vietnamese learner at any level, give it at try! The tests are all at https://www.zhtoolkit.com/apps/wordtest/.

Continue Reading15 years of the Known Words Test, now with Vietnamese!

8 month progress in Vietnamese

Eight months ago I embarked on a new adventure, jumping into the Vietnamese language with almost no prior knowledge of it. With years of experience studying Chinese, I had some ideas on the methods I wanted to use. So how well did they work?

The number of hours I have put into studying up to now is roughly 300 hours. At first, I was solely grinding flashcards for about 90 minutes per day, since everything was unfamiliar and difficult. Once my spaced repetitions lengthened (and not adding many new words during this time), after a few months the daily reviews had shrunk to 30 minutes. In the past 2 months I have started adding new words again plus reviewing more aggressively. Now the flashcard reviews are consistently 1 hour per day. Five months ago I started weekly 1-hour tutoring session. Throw in some daily minutes for short readings, and the total over 8 months is roughly 300 hours.

How fluent am I?

so where am I at, in this learning journey? At this point I am definitely in the advanced beginner category. In terms of the European CERF levels, I would say I’m in the middle of level A2 – not a beginner anymore, but not yet at the intermediate level. But I am better in some aspects than others. Listening ability has been particularly challenging, and it lags the other areas. So I prefer to think of levels per skill, which I would break down this way.

With a focus on vocabulary, I can read simple texts graded for beginning or intermediate learners. I am probably a few months away from being able to read easier native texts. I can make out the gist, but there are enough unknown words that the meaning isn’t always clear.

Writing and speaking are nearly the same level, since they are both producing the same output. Writing lags because of spelling errors, which is a particular challenge of Vietnamese. Using the Vietnamese TELEX input method, my mobile phone offers word suggestions to help autocomplete words. But typing on a computer requires exact spelling. My pronunciation has gotten better than it was for the first few months when I didn’t have formal instruction and had to make do. But I’m sure it’s still bad. My tutor can understand me most of the time, but my Vietnamese friends have a harder time.

Listening ability is particularly challenging for me. But it’s been that way for every language I have learned. Even when I can understand the individual words, I can’t process them fast enough to feel the meaning of the full sentence. I have managed to find a few websites that help practice this area, and I am improving, but slowly.

As for the number of words I know, I can estimate this a few different ways.

(1) A “mature” Anki card is one with a spaced repetition period of >= 21 days, which means it’s been successfully remembered multiple times. I have 4000 mature cards as of now. roughly one half are Vietnamese->English and one half are English->Vietnamese. I should also factor in my average historical forgetting rate of 85% for all mature cards. Thus, I probably know well at least 1700 Vietnamese words, plus more which I am in the progress of learning.

(2) The website 17 Minute Languages asks you to mark a frequency-ranked series of words that you know. After submitting the first set, a second more refined set of words will help hone in on a more precise estimate. Taking this test periodically, there are variations in the results each time, since the words are random. But it hovers around 1900. The site calls that an A2 level of knowledge.

(3) There is a version of my online Chinese word knowledge test that I use offline, with a Vietnamese word list instead of Chinese. This gives wildly different estimates every time I try it. When I tested a few weeks ago, I got 1755-2100 (+/-400). However, 4 months ago I got a result of 2176-3108, when I didn’t even have that many words in my flashcards. Knowing a random uncommon word bumps up the estimate a lot. I need to look into the software calculations, methods, and quality of the word list to hopefully improve the result quality.

While I am far from fluent, I can definitely notice steady progress. Many people know about the “intermediate plateau”, where getting from intermediate level to advanced fluency takes a long time and a lot of effort. But I think there is also a “beginner’s plateau”. At the beginner level you build up your knowledge of grammar and frequent words. But to reach the intermediate level, you need to be able to read and listen to more native content, and that requires a level of language knowledge that is not a smooth jump from a beginner level. So the progress from A1 to A2 can happen quickly and smoothly, but from A2 to B1 takes an extended effort over a longer time. I only started to notice reading become easier a month ago, or 7 months into my study. I had been trying to read a difficult native level text, full of colorful prose with rare words that I had trouble remembering. One day I accidentally turned to the forward written by the author, instead of the main text. And I could understand it! I came across many words I had studied as flashcards, so I was adding additional mental connections for useful words. If I had more access to reading material graded for advanced beginners I would have more opportunities like this. Finding resources like this has been a challenge in Vietnamese, which doesn’t have the same numbers of learners as the most popular foreign languages to learn.

What worked?

Curated sentence flashcards: I knew I wanted to start with not only word study, but sentence flashcards in order to quickly get familiar with grammar patterns and word clusters. I found one deck, Vietnamese to English vocab and practice, that contained 500 sentences, a decent collection of grammar examples and beginner level words. As new sentences were introduced daily, it was interesting to see the way my understanding of the grammar grew. Sometimes I had a theory on a grammar point or the meaning of a phrase based on the sentence and translation. But then when I encountered another card later, it might disprove my theory and I would need to modify my understanding.

Curated word lists: Similarly, I used the same downloaded flashcards as my initial list of 1000 words to study. Again, it was a fairly decent set of words for a beginner. It took exactly 6 months before all the cards had been introduced, as I was limiting myself to 100 cards per day and 20 new cards maximum. This tended to slow down the rate of new words as time went on, as more and more of the 100 daily cards were taken up by reviews. I finally upped the daily limit to higher numbers, which finally led to the last new cards being added.

Tutors: Studying languages alone as a hobby can be fun but still challenging. In Vietnamese, it was particularly hard due to needing to learn pronunciation rules. I had a textbook that was quite thorough with linguistic descriptions and tongue positions for how to pronounce sounds, but with no audio examples I had no idea whether I was understanding the proscriptions correctly. YouTube and podcasts had occasional clips that described some pronunciation highlights, but nothing that had the comprehensive set of rules that would help me read anything I would encounter. Additionally, there are some differences between Northern and Southern Vietnamese, which narrowed the available content even further.

SVFF, or Southern Vietnamese For Foreigners is a Saigon-based company that provides 1-on-1 lessons, video courses, and structured learning materials. I chose the teacher whose introduction sounded the most focused on using technology for teaching, which I saw as a plus. We have weekly 1 hour sessions on Zoom. Often there will be a fun game or quiz to get a conversation started, but the lessons are mostly free-form discussions that expose new words and expressions. Each lesson I speak more and more, and even though there are frequent mistakes, those are opportunities to learn something new. In addition to the class, there is a good set of offline reading material, including detailed pronunciation guides. Not only was it the full set of instructions on how to pronounce all the sounds, but it was attentive to the particulars of the Southern accent, or even Saigon-specific accent.

Graded readers: Along with the grammar books I bought and quickly moved on from, I obtained one book that was a graded beginner level reader, “69 Short Vietnamese Stories for Beginners” by Adrian Gee. Once I learned my first 1000 words from my flashcard set, these stories were a simple 5 minute read (about 150 words each). Each story introduced some new words specific to the topic, with definitions after the story. Besides those, there were around 5-10 additional words in each story I didn’t know, so I was constantly adding extra words to a flashcard set.

Graded audio stories with dictation practice: I wanted some way to improve my listening skill, but random YouTube videos were not cutting it. I needed something slower and simpler that wouldn’t overwhelm me. I found Langi, which has fun news stories with audio from levels A1 to B2. The different skill levels seem to be the same stories, just with longer sentences and harder words. In addition to audio narration of the whole story, there is an exercise to review it sentence by sentence. There is also a dictation exercise, where you need to type each sentence as you hear it, repeating the audio as many times as needed. This is a difficult exercise, but it has been a huge help in improving my spelling. When reading, it’s easy to gloss over a faulty memory on word spelling, but when you have to produce a 100% output, mistakes are an opportunity to rewire brain connections.

I am aware of a different service, Glossika, but I have not yet tried it. This site has individual sentences rather than stories, but it may also be an effective practice, and I plan to try it out soon.

Motivation: Despite being a challenging language to learn, I have been enjoying the process so far. Progress is slow but still noticeable. I do flashcard reviews nearly every day, my longest streak being 78 days in a row. I don’t know yet how far I will go before I’m satisfied with my level, but for now I am excited for the future and my expanding my skills.

What didn’t work?

Not learning pronunciation right away: Due to a lack of good learning resources, I started studying with an incomplete understanding of how to correctly pronounce written words. This meant that even though I was learning words and sentences through flashcards and even with correct spelling, I still wasn’t pronouncing them correctly. With the help of tutoring resources I did finally learn, but I did have to relearn words that I had been saying wrong.

Anki decks of the “top N words”: This is something some language learners propose will get you to fluency quickly. Since 80-85% of a typical text is from the 1000 most frequent words (at least for English, see Nation, P. (2006)), all you need is to memorize the top 1000 words and you can basically understand anything. But note that those same research articles tested how much a learner could comprehend from knowing 80%, and it was basically gibberish. From expirimentation, most learners would need 98% coverage to comprehend a text.

I wasn’t completely convinced that a list of the top 1000 words would be the key, but I thought I would give it a shot. I quickly gave it up after a few days as being completely worthless. A large number of those words are grammatical connectors with vague and multiple definitions as isolated words. Trying to memorize them out of context was both nearly impossible and unenjoyable.

Audio flashcards from native content: Despite my listening skill being challenged, I thought if I just practiced listening to native content with determination, I would have to improve. I downloaded audio from podcasts, used the Amazon transcription tools to convert it to a subtitle file (which worked impressively well), then another tool to split the audio and transcript into Anki flashcards. In addition, I pulled new words from the transcript to make definition flashcards. The result was just too difficult to have any benefit.

Immersive input: I take no stock in massive exposure to incomprehensible audio. In fact, I think it’s worse than doing nothing, because it gets you accustomed to tuning out and accepting that you don’t understand anything. But I thought that if I had free time anyway, why not just put on a podcast and maybe I’ll get something out of it. But hours of listening gave no benefit at all, picking out around 1 word every 5 minutes. Yet, if I had the transcript where I could read at my own speed, I might understand the gist. I will need a lot more low-level listening practice to bring up my level to my other skills.

Flashcards from native texts: Extracting difficult words from a text into flashcards to help with understanding wasn’t a terrible experience, but it wasn’t as effective as I expected. My method was to use a version of my Chinese Word Extractor that worked with Vietnamese (still in progress), filter out words only used once, and pick words that were more frequent in the text than in the language as a whole. I may have picked up a few useful words this way. But most of the words were less helpful for comprehension than their frequency in the text would suggest. Many of the words picked up by this method are just colorful literary synonyms used only in print but not in the spoken language. Another lesson learned from this experiment is to only make flashcards for Vietnamese to English, and not from English to Vietnamese. A method I have found more effective for reading is just read without any vocabulary preparation, and to only make flashcards for the unknown words that are essential to understanding the meaning of a sentence.

What’s next

For now, my plan is to mostly continue my existing methods. I continue to add 10-20 new flashcards every day, so my knowledge will slowly progress. I am already able to understand easier YouTube content made for learners, so listening practice is something that I can slowly increase. I would really like more offline books for free relaxed reading. The most exciting plan for the future is that I will be going back to Vietnam in 2 months. After 10 months of study, I want to know how my experience will be different when the language is no longer incomprehensible.

Continue Reading8 month progress in Vietnamese

Adventures in Vietnamese

For some personal reasons, I have a recent project to learn Vietnamese, not just basic words and phrases, but to become somewhat fluent. With all the years of studying Chinese, I had never gotten past an intermediate level. That level of fluency would be my goal for Vietnamese. It’s enough to read basic books and know about 70% of the text. And it’s enough to have basic conversations and be able to express myself, even if I need to substitute unknown words on occasion.

Jumping into a language completely cold is a great opportunity to take all that I have learned from Chinese, and see what applies more generally to language learning. What tools and strategies can advance my levels quickly, without getting sidetracked on ineffective methods?

First impressions

I have gone to Saigon twice, once a year ago, and again last month. The only studying I had done up to last month was a few Pimsleur lessons. I had learned to say “How are you” and was excited to use it. When I said it to my friend’s mother, it turns out I had used the wrong pronoun considering our age differences, and it was offensive. I also used standard Vietnamese and not the Saigon dialect, so it was the wrong syntax anyway. The result was just confusion on their part. It was discouraging, but it also woke me up to understand that it’s not going to be an easy language to learn.

Last year when I went, I stayed with my brother who lived in Saigon and spoke English. There was no need to know any Vietnamese, and I left the trip not having picked up anything. Then, a month ago, I went to Vietnam with a friend, and we stayed with her extended family. This was a completely different experience, and my lack of language skills became a clear problem. I needed to quickly pick up some useful phrases to get by. Since my friend was a native speaker, all I needed were a few survival phrases. It turns out this is all I needed:

  • bathroom
  • man/woman (to distinguish bathroom signs)
  • what is your name
  • it’s nice to meet you
  • thank you
  • hello
  • good morning/afternoon/evening/night
  • see you later
  • it’s delicious
  • I’m full
  • I’m tired
  • My stomach is bad (yes, you can have too much noodle soup)
  • excuse me/I’m sorry
  • pronouns!

Some other terms that I didn’t use but might have been useful for rudimentary conversation include:

  • yesterday/today/tomorrow
  • my job (IT or “computers”)
  • numbers 1-10
  • months
  • expressing time
  • to go/arrive/return

Unlike English and Chinese, it’s critical to use the correct pronoun, considering the person’s gender along with your relative ages. My impression so far is that using the wrong word to address someone is a serious offense that is going to cause some bad feelings. There are neutral pronouns to play it safe, and every formal lesson I have encountered so far — Pimsleur, textbooks, and flashcards — all use these neutral pronouns. If you relied solely on these, you would be missing a lot of important practical usage.

Similar to Chinese, Vietnamese has tones. There are a few more than standard Mandarin, but at least I was prepared for it and knew to include it in my word learning. But Vietnamese has many vowels and vowel combinations, and those also have diacritical marks. I hadn’t learned to read these, and vowel diacritics and tone markers all meant the same for me. But  is a flat tone vowel, à is a tone, and Ẫ is a vowel with a tone.

A lot of Vietnamese is pronounced very different from how it’s spelled. Chinese (pinyin or other English representation) did have that too, like xin and qin and si. Vietnamese has more letters, and more variations. if you have ever ordered phở (beef broth soup), you already know that it is not pronounced like English “foe”. And many sounds are unfamiliar to English speakers, and it is taking some training to move my mouth in different ways. Correct speaking is essential, as it’s easy to be misunderstood with just a slight difference in tone or pronunciation accuracy. I will prove this out someday, but my hunch so far is that there are more one syllable words in Vietnamese than in Chinese. This means that there is a higher ambiguity with an incorrect word since there is less context to compensate for it. Apparently, “I want to buy a house” sounds similar to “I want to sit now” at the poor speaking level I am at so far! As a bonus, I need to focus on Saigon dialect, while most learning material is the Hanoi dialect.

The same goes for spelling, as the correct diacritics and tone marks are necessary for being understand. The built in input method in both Windows and Android is TELEX, which allows adding marks by inputting extra letters after the vowels. It isn’t too hard to pick up. Writing out words has helped me remember words better, since I need to pay attention to the exact spelling, versus casual reading in which it’s easy to gloss over the vowel marks.

Vietnamese uses the Latin alphabet (plus all the diacritics) instead of characters. But it is similar to Chinese in that words are all separated per syllable. Thus, some mental processing is required to parse out the actual words from these syllables. In my first attempt at reading a full text, I really had no idea what I was looking at. Without being able to pick out key words, reading any text isn’t something I can tackle at this point.

With all that in mind, how should I study?

Proposed strategies

I don’t know an easy hack for learning Chinese, or any language at higher levels. What worked for me was a lot of exposure, and spaced repetition in both reading and writing. I drew from available dictionaries, corpora, and word lists available for download, along with my own Chinese Word Extractor program to create word lists and flashcards from texts. Once I knew enough words, reading texts was possible, and that allowed for even more exposure. But my biggest accomplishment was being able to read Harry Potter in Chinese. In fact, I still haven’t read it in English.

Attempting the same for Vietnamese, I see how spoiled I was with Chinese. As of now, there are 807 Aniki flashcard shared decks for Chinese flashcards. In Vietnamese, there are 123, and many of those are for learning English from Vietnamese, or Chinese from Vietnamese. Chinese has Pleco software which is a great tool for dictionary lookups and etymology. Chinese has official word lists geared for the various levels of the HSK exams.

To get up to speed, I want to set up my pipeline of flashcards to rapidly achieve a base knowledge of core words. Flashcard lists with Vietnamese audio would be especially useful.

The Anikweb flashcards I started with are these:

  • Xefjord’s Complete Vietnamese (Southern): 201 words and phrases with audio. I went into Vietnamese completely cold, so having audio cards was helpful to start with. Skip the “Core Vietnamese Vocabulary” which is just 16 phrases with some weird ones like “Hello America.”
  • Vietnamese to English vocab and practice: 500 sentences and 1069 words. No audio, but the example sentences are pretty reasonable for a beginner level, helping with both common words and grammar

I created a fork of my Chinese Word Extractor program that works with Vietnamese. It is still in progress, but modifying the logic to handle Vietnamese instead of Chinese was only a few hours of effort. The greater work was updating the program for Python 3, and switching the GUI framework from WX to tkinter. For a free Vietnamese to English dictionary, I used VNEDICT from Paul Denisowski, who was actually the original creator of CEDICT.

Armed with a vocabulary extractor, I can now create my own word lists. So far, I have taken a list of parsed words from OpenSubtitles, combined it with VNEDICT English definitions, and made a flashcard set from the top 1000 most frequent words.

Eventually, I will make a Vietnamese version of my Word Test online program to track my word knowledge.

I will spend about an hour every day studying. Keeping in mind my rough starting date of March 3, 2025, I want to keep more notes on the timeline for major milestones. While the results are entirely personal, it will give me a rough idea of how long it would take to learn any language with little to no prior knowledge.

Continue ReadingAdventures in Vietnamese