Let’s say you are going to work on some meaningless (not to say useless) piece of string processing code. One good way to test it is to apply it on a huge set of strings. You can take some huge random text. You may also want to use a lexicon for problems like palindrome, isogram, blanagram or any funky word list generation.
Let’s pick a language at random… French!
Your favorite search engine will probably returns the word list genrated from the Francais-Gutenberg lexicon by Christophe Pallier (link). It’s a 3.6MB file containing 336531 french words.
Another alternative is to use the lexicon from the ‘Association des Bibliophiles Universels’ (link). Unfortunately there’s a file per letter. So you need to remove the licence and footer before concatenating them.
#!/bin/bash for i in {a..z} do wget http://abu.cnam.fr/cgi-bin/donner-dico-uncompress?liste_$i -O src/list_$i tail -n +60 src/list_$i | head -n -2 | cut -f1 >> words.DICO.txt done
The generated file contains 289576 words.
Finally, you can use the Morphalou lexicon from the Centre National de Resources Textuelles et Léxicales (lien). Unfortunately, the lexicon is a 155MB XML file containing the list of inflected forms for each entry. I will not describe the XML structure here. Long story short, we will extract all the ortography tags and remove duplicates.
grep -in orthography Morphalou-2.0.xml | perl -pe 's/.*<orthography.*>(.*)<\/orthography>/$1/' | awk ' !x[$0]++' > words.Morphalou.txt
This wonderful one-liner generates a file containing 410940 words.
The awk ‘ !x[$0]++’ is blatantly ripped from this stackoverflow answer.