{"id":988,"date":"2014-10-14T21:06:04","date_gmt":"2014-10-14T20:06:04","guid":{"rendered":"http:\/\/blog.blockos.org\/?p=988"},"modified":"2014-10-14T21:06:04","modified_gmt":"2014-10-14T20:06:04","slug":"inside-croissant-mountain-evil-takes-its-form","status":"publish","type":"post","link":"https:\/\/blog.blockos.org\/?p=988","title":{"rendered":"Inside Croissant Mountain, Evil Takes Its Form."},"content":{"rendered":"<p>Let&#8217;s say you are going to work on some meaningless (not to say useless) piece of string processing code. One good way to test it is to apply it on a huge set of strings. You can take some huge random text. You may also want to use a lexicon for problems like palindrome, isogram, blanagram or any funky word list generation.<br \/>\nLet&#8217;s pick a language at random&#8230; French!<br \/>\nYour favorite search engine will probably returns the word list genrated from the <em>Francais-Gutenberg<\/em> lexicon by <em>Christophe Pallier<\/em> <a href=http:\/\/www.pallier.org\/ressources\/dicofr\/dicofr.html>(link)<\/a>. It&#8217;s a 3.6MB file containing 336531 french words.<\/p>\n<p>Another alternative is to use the lexicon from the <em>&#8216;Association des Bibliophiles Universels&#8217;<\/em> <a href=http:\/\/abu.cnam.fr\/DICO\/donner-dico-uncompress.html>(link)<\/a>. Unfortunately there&#8217;s a file per letter. So you need to remove the licence and footer before concatenating them.<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n#!\/bin\/bash\r\nfor i in {a..z}\r\ndo\r\n\twget http:\/\/abu.cnam.fr\/cgi-bin\/donner-dico-uncompress?liste_$i -O src\/list_$i\r\n\ttail -n +60 src\/list_$i | head -n -2 | cut -f1  &gt;&gt; words.DICO.txt\r\ndone\r\n<\/pre>\n<p>The generated file contains 289576 words.<\/p>\n<p>Finally, you can use the <em>Morphalou<em> lexicon from the <em>Centre National de Resources Textuelles et L\u00e9xicales<\/em> <a href=http:\/\/www.cnrtl.fr\/lexiques\/morphalou\/>(lien)<\/a>. Unfortunately, the lexicon is a 155MB XML file containing the list of inflected forms for each entry. I will not describe the XML structure here. Long story short, we will extract all the <em>ortography<\/em> tags and remove duplicates.<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\ngrep -in orthography Morphalou-2.0.xml | perl -pe 's\/.*&lt;orthography.*&gt;(.*)&lt;\\\/orthography&gt;\/$1\/' | awk ' !x[$0]++' &gt; words.Morphalou.txt\r\n<\/pre>\n<p>This wonderful one-liner generates a file containing 410940 words.<br \/>\nThe <strong>awk &#8216; !x[$0]++&#8217;<\/strong> is blatantly ripped from this <a href=http:\/\/stackoverflow.com\/questions\/11532157\/unix-removing-duplicate-lines-without-sorting>stackoverflow answer<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let&#8217;s say you are going to work on some meaningless (not to say useless) piece of string processing code. One good way to test it is to apply it on a huge set of strings. You can take some huge random text. You may also want to use a lexicon\u2026 <a class=\"continue-reading-link\" href=\"https:\/\/blog.blockos.org\/?p=988\">Continue reading<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[27,25,12],"_links":{"self":[{"href":"https:\/\/blog.blockos.org\/index.php?rest_route=\/wp\/v2\/posts\/988"}],"collection":[{"href":"https:\/\/blog.blockos.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.blockos.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.blockos.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.blockos.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=988"}],"version-history":[{"count":6,"href":"https:\/\/blog.blockos.org\/index.php?rest_route=\/wp\/v2\/posts\/988\/revisions"}],"predecessor-version":[{"id":994,"href":"https:\/\/blog.blockos.org\/index.php?rest_route=\/wp\/v2\/posts\/988\/revisions\/994"}],"wp:attachment":[{"href":"https:\/\/blog.blockos.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=988"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.blockos.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=988"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.blockos.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=988"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}