Home » design, education

Most common Persian words

10 November 2007 13 Comments

As a second-generation Iranian American, who has spent practically no time in Iran, I have found it difficult to learn the Persian language beyond mere kitchen talk. In an effort to improve my vocabulary, I sought out a list of the most common Persian words. I could not find such a list, so I searched for a Persian-language corpus that I could use to produce the list myself.

I came across the Hamshahri Persian Corpus and decided to use it. I ran a word count on the corpus to determine what the most common words are in the Persian language. I posted the results sorted by the most frequently used words here.

The list was rather long, so I’ve only included words that appeared in the corpus over 1000 times. I plan to start at the top of the list and make flashcards out of any words I don’t know or am unsure of. This should help me focus on words that are more commonly used. I hope you find it useful as well. I will post the Java code I used to parse the corpus if anybody is interested.

If I ever find the time, my next goal is to try to find phrases, word combinations, and word patterns. If anybody is interested in helping out, please let me know. I’d also be interested in finding out about similar (non-commercial) efforts for other languages, particularly other indo-european languages or other languages that use an Arabic script.

13 Comments »

  • Shahram said:

    I came across your blog from on of your article in DevX.com.
    I work as Java/J2EE developer in Canada.
    It was exciting to find an Iranian with java expertise and many articles and book.
    I just want to say hello wish you best of luck.

  • Ansa said:

    Hey! Thanks for the list of common Parsi words. Exactly what I was looking for to help me learn!

  • Martin Roberts said:

    Dear Javid,

    I was most impressed when I was searching for a list indicating the word frequency of Persian words, that I found your list.

    I am starting a course to teach both children and adults and will be using your list as a basis for teaching and learning Persian.
    (I have successfully used similar lists to teach English.)

    One of the things that I have done with your list is use it as the basis for a automated vocabulary assessment.

    This program intelligently selects typically 20 words, and based on which of these words the student knows, the program estimates the size of the user’s vocabularly.
    (Where there is some hefty maths involved to determine how to most efficiently select which words are asked to the student)

    This program can then be run periodically to determine how well the student is learning…

    For beginners, a list of 5000 words is more than sufficient for this purpose, however, for more skilled (but not fluent) students the list needs to be significantly larger, to cater for the rarer words.

    My technical skills are more in the line of maths, and teaching — with only a modest set of computer programming skills, but it seems that you are far more proficient at databasing than me. Is it possible to re-run your program (if you still have it) to produce, say the top 20,000 words?

    Kindest regards,

    Martin Roberts
    Tasmania, Australia.

  • Sarah Mottaghinejad said:

    I have been trying to do the exact same thing! I found the Hamshari corpus, but they must have made some improvements to it because when I tried to use it gave me error after error. Thank you for this list! I would love to do a lexicon for these words with pronunciation and typical collocations and phrases. Maybe you could help me figure out how to use the corpus and we could work together?

  • JB said:

    Hello Javid. Thanks for publishing this list. I’ve made a set of flashcards with definitions from the Aryanpour Persian-English dictionary and I’ve reviewing them using a flashcard software. It has become an important part of my Persian studies.

  • Honso said:

    Hi,
    I’m using this word list in my Android app (MultiLing Keyboard). It will be available in next release.

    I’m also interested in the Java code for the parser.
    Thanks!

  • Landon said:

    thanks alot! i am american. i have travelled the world as a young man. i speak 3 languages fluetly right now. i am working on chinese and persian. thanks alot. it is very hard to find resources online to learn persian. thanks again.

  • Shepard25Melba said:

    The home loans suppose to be useful for guys, which would like to organize their career. As a fact, it is not really hard to get a small business loan.

  • Tony said:

    Your Farsi word frequency list would be much improved as a learning tool if one could click on each word and gets it English meaning, and/or if there was also a list with the English meaning next to the Farsi word.

  • romillyh said:

    Hmm, interesting, but ultimately frustrating and impractical for the purposes suggested here. So when as a result of a continuing search for categorised word lists I found Colin Turner’s “Thematic Dictionary of Modern Persian” I ordered it on the spot. It’s not perfect – for example, in the first section on air travel the word for seat does not appear (good lord!), although “reserved seat” does. Nor do we find baggage rack, window (panjereh – as in house?) seat, aisle seat, economy/business class, refreshments, meal, check-in time (tho check-in desk is there), departure time, and I’m sure a host of other things. The word given for airplane is tayaareh when I always knew it as havaapeyma. And we get words for parachute and sound barrier, which are somewhat redundant for an air traveller! But there are 84 terms followed by 48 variably useful phrases, some of which should really be in the main list of terms. In hardback the book is horribly expensive at full price, but on Amazon UK I found a new paperback copy at £30 ($46.50).

    So, what’s the problem with the Hamshahri corpus as a vocab learning tool? Yesterday, after finding Javid’s post, I spent some time going through the first 200 entries (as I see some others have done too) and quickly hit the drawbacks – which are fairly obvious on a little reflection. Here they are:

    No. 1, Subject matter: The database for this very large corpus is entirely from a single newspaper, and therefore represents the vocab you’d expect from the various topics (“politics, city news, economics, reports, editorials, literature, sciences, society, foreign news, sports, etc.” – Wiki) covered by a mid-range urban newspaper (“a major national Iranian Persian-language newspaper published by the Municipality of Tehran” – Wiki). So, it is definitely not everyday vocab, and not the sort of stuff you’d want to get under your belt in the beginner to early intermediate stages of vocab acquisition. It might help you follow a news broadcast, but you’d find little that would enable you to follow a film or chat to friends.

    No. 2, Atomisation: The second major drawback you immediately hit in working with the Hamshahri frequency list is the way words are broken up into “particulates” like raa (“word” no. 8 in the list), -and (3rd pers plural verb ending), -haa and -haye, and even mi-. This may be ok for single word forms like nouns, but it just doesn’t work for verbs, which of course appear separately in all their conjugated forms – for example shode (no. 13), shod (no. 16) and mishavad (no. 22). This means that the frequency of verbs (here, shodan) is completely misrepresented. Equally problematic is that the host of important compound verbs simply don’t appear at all! (A useful list is here: http://persian.nmelrc.org/pvc/compounds.php). Darrudi, Hejazi and Oroumchian, in “Assessment of a modern farsi corpus”, discuss some of these problems in using the Hamshahri corpus as a basis for analysing word frequency in farsi (search on “hamshahri corpus zipf”, where “zipf” is the analytical method). Plus of course there are loads of words like va (no. 1), dar (no. 2), and be (no. 3) that you hardly need to be bothered with.

    No. 3, Extraction is a ton of work: And in terms of what you get out of it, too much work! That’s why I ended up ordering Turner’s book, whatever its own drawbacks. With Hamshahri you have to sit with a dictionary translating and sometimes transliterating the words you don’t know. I use Haim’s Persian-English (1975 edn) dictionary. 200 words took up much of the day! I’d guess you might manage 500 words per day, so 3,000 words would take a week. Of those 3,000 “words” only 2,000-2,500 might be any use because of all the redundancy. And many certainly would not be words you’d find yourself needing in everyday life in Iran, that’s for sure!

    So, all in all the tempting idea that one can “cheat” or fast-track one’s way into acquiring prime farsi vocab by learning the first 3,000 words in a frequency list like Hamshahri is a (complete) chimera. It has the further disadvantage that one is learning vocab out of any context. This is the attraction of a list organised by topic, as in Turner’s “dictionary”. Turner explains very well in his Introduction to the book why it is so much better to have a dictionary organised thematically rather than alphabetically. On Amazon you can find the book and read the preliminary section, along with the air travel section and the complete and very extensive Index. My only further gripe with the book is that in the main sections the words are listed in persian alphabetical order, which is quite ludicrous considering that the users are going to be english-speakers seeking words in their own language! Incidentally the persian is given in both farsi script and a good, no nonsense transliteration, if I am slightly surprised that qaaf and ghein are both transliterated as “q”, with no explanation from the author. I used to live in Baghe Saba, and to my ear Bagh is definitely different from, say, Qom. But never mind, with some 550 pages of words (including 12 pages on slang) and a 127-page index the dictionary has to be pretty useful. More precisely, there are claimed to be 25,000 entries organised under 70+ topics – which by the way include the “heavy” stuff like economics, politics and development that feature large in the Hamshahri corpus. Now that has to be a handy short cut!

    About me
    Goodness, I’ve been hearing farsi spoken on and off since 1968, but like Javid have never really got to grips with it. My farsi is “shopping” farsi acquired during two years teaching and doing other work in Tehran in the early 1970s. In about 1977 I did a completely ridiculous year in the Persian department at Edinburgh Uni where one of the set books was Ali Dashti’s Ayaame Mahbas, probably because the author was a friend of my professor (Lawrence Elwell-Sutton). This flowery writer was a guy who most definitely could not call a spade a spade! Consequently I learned nothing that I can remember now aside from learning to write quite well. Now that we have YouTube I watch a lot of persian stuff and have become immensely frustrated that I can’t understand much even though so much of it is so, so familiar. I’m one of those crazy westerners who find everything – er, almost everything – persian and iranian totally compelling, like Iran and its truly effervescing culture are my second home and somewhere where I feel I’ve got to be from time to time. E.g. Gugush songs, which I’ve had for decades, just tear me up. Oh dear!

  • romillyh said:

    Should just add, in case I have confused people: Javid extracted the ranked frequency list provided in his post from the Hamshahri corpus (thank you Javid!). The Hamshahri Persian Corpus itself is a database made by the DBRG (Database Research Group) Lab at the University of Tehran of more than 300,000 articles from the Hamshahri newspaper starting April or June 1996. As to the end date, version 2 runs from 1996 to mid-2007 (http://ece.ut.ac.ir/dbrg/hamshahri/). My impression remains that to extract useful material from this database you’d have to be fluent in farsi in the first place.

    Re my other link to compound verbs, the home page of the enormously useful University of Texas site is http://persian.nmelrc.org/index.html. This appears to be a site providing materials for a course in farsi that has now been superseded, but gives some great material nonetheless. In the Audio section for example try “My Cat” (near the bottom). Click on the “Text” link (http://persian.nmelrc.org/audio/myCat1.html) to see just how good this is. The audio links at the side don’t take you off the page, so you can listen while reading the farsi (and transliteration if you need it). The whole unbroken audio chunk can be lstened to from the first link in the “My Cat” section.

  • Milad said:

    Hey there,
    I’m Iranian and I’m living in Iran right now i bumped into your blog accidentally and will be happy if i can help you somehow.

    Regards

  • Joseph Marzbani said:

    I appreciate your attempt. I’m a Java developer and I’ve always felt a need for such corpus. Thank you …