Inside Croissant Mountain, Evil Takes Its Form.

Let’s say you are going to work on some meaningless (not to say useless) piece of string processing code. One good way to test it is to apply it on a huge set of strings. You can take some huge random text. You may also want to use a lexicon for problems like palindrome, isogram, blanagram or any funky word list generation.
Let’s pick a language at random… French!
Your favorite search engine will probably returns the word list genrated from the Francais-Gutenberg lexicon by Christophe Pallier (link). It’s a 3.6MB file containing 336531 french words.

Another alternative is to use the lexicon from the ‘Association des Bibliophiles Universels’ (link). Unfortunately there’s a file per letter. So you need to remove the licence and footer before concatenating them.

#!/bin/bash
for i in {a..z}
do
	wget http://abu.cnam.fr/cgi-bin/donner-dico-uncompress?liste_$i -O src/list_$i
	tail -n +60 src/list_$i | head -n -2 | cut -f1  >> words.DICO.txt
done

The generated file contains 289576 words.

Finally, you can use the Morphalou lexicon from the Centre National de Resources Textuelles et Léxicales (lien). Unfortunately, the lexicon is a 155MB XML file containing the list of inflected forms for each entry. I will not describe the XML structure here. Long story short, we will extract all the ortography tags and remove duplicates.

grep -in orthography Morphalou-2.0.xml | perl -pe 's/.*<orthography.*>(.*)<\/orthography>/$1/' | awk ' !x[$0]++' > words.Morphalou.txt

This wonderful one-liner generates a file containing 410940 words.
The awk ‘ !x[$0]++’ is blatantly ripped from this stackoverflow answer.

Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

What is 2 + 9 ?
Please leave these two fields as-is:
IMPORTANT! To be able to proceed, you need to solve the following simple math (so we know that you are a human) :-)

This site uses Akismet to reduce spam. Learn how your comment data is processed.