lexicon – BlockoS

Let’s say you are going to work on some meaningless (not to say useless) piece of string processing code. One good way to test it is to apply it on a huge set of strings. You can take some huge random text. You may also want to use a lexicon for problems like palindrome, isogram, blanagram or any funky word list generation.
Let’s pick a language at random… French!
Your favorite search engine will probably returns the word list genrated from the Francais-Gutenberg lexicon by Christophe Pallier (link). It’s a 3.6MB file containing 336531 french words.

Another alternative is to use the lexicon from the ‘Association des Bibliophiles Universels’ (link). Unfortunately there’s a file per letter. So you need to remove the licence and footer before concatenating them.

#!/bin/bash
for i in {a..z}
do
	wget http://abu.cnam.fr/cgi-bin/donner-dico-uncompress?liste_$i -O src/list_$i
	tail -n +60 src/list_$i | head -n -2 | cut -f1  >> words.DICO.txt
done

The generated file contains 289576 words.

Finally, you can use the Morphalou lexicon from the Centre National de Resources Textuelles et Léxicales (lien). Unfortunately, the lexicon is a 155MB XML file containing the list of inflected forms for each entry. I will not describe the XML structure here. Long story short, we will extract all the ortography tags and remove duplicates.

grep -in orthography Morphalou-2.0.xml | perl -pe 's/.*<orthography.*>(.*)<\/orthography>/$1/' | awk ' !x[$0]++' > words.Morphalou.txt

This wonderful one-liner generates a file containing 410940 words.
The awk ‘ !x[$0]++’ is blatantly ripped from this stackoverflow answer.

Tag: lexicon

Inside Croissant Mountain, Evil Takes Its Form.

Archives

Recent Posts

Music

My other stuffs

PCEngine

Archives

Recent Posts

Tags

Music

My other stuffs

PCEngine