Content from 2017-06
The origin post was at https://veer66.wordpress.com/2017/01/19/benchmark-thai-word-tokenizers/ posted on 2017/01/19.
I wonder about speed of programs written in different languages. For example, I wonder whether one written in Kotlin and ran on JVM is slower than one written in Go. Although there are several existing benchmarks, this is one may be still important at least for me, because Thai word tokenizer is my real task.
So @iporsut and me wrote some programs in different programming languages and tried to optimize them.
I conducted the experiment on my laptop computer, which has Intel® Core™ i3-4030U CPU @ 1.90GHz × 4, on a 20MB Thai text corpus.
- Rust #1: 3.366 #2: 3.247 #3 3.241 #Avg: 3.284
- Go #1: 5.415 #2: 5.405 #3 5.416 #Avg: 5.412
- Crystal #1: 5.637 #2: 5.679 #3 5.649 #Avg: 5.655
- Kotlin+Clojure #1: 6.547 #2: 6.743 #3 6.628 #Avg: 6.639
- Julia #1: 38.316 #2: 38.112 #3 38.237 #Avg: 38.221
- Python #1: 50.624 #2: 50.803 #3 50.869 #Avg: 50.765
- Clojure+Kotlin #1: 63.502 #2: 67.561 #3 67.303 #Avg: 66.122
- Env: Rust Nightly 2017-01-08, Worcut source code
- Env: Go 1.7.4, Wordcut source code
- Env: Crystal 0.20.5, Wordcut source code
- Env: Kotlin 1.0.6 + Clojure 1.8.0 + OpenJDK 1.8, Worcut source code, Worcut source code
- Env: Julia 0.5.0, Worcut source code
- Env: Node.js v6.5.0, Worcut source code
- Env: Python 3.5.2, Worcut source code
- Env: Clojure 1.8.0 + Kotlin 1.0.6 + OpenJDK 1.8, Worcut source code
@iporsut has already written multicore versions, so maybe next month I will conduct another experiment.
I happened to think what if I die, who will pay for my hosting. So today I migrate my main website and blog posts from PicoCMS hosted at Scaleway to Jekyll hosted at Github. What I cannot pay for domain name, my site can be still accessed by veer66.github.io
PicoCMS and Jekyll are based on Markdown so I just wrote a script for renaming my blog post file name and modifing some metadata by the shell script below:
for x in *.md
T=`head -n4 $x | grep '^Title:' | sed 's/Title: //' | sed 's/[ "\|?\/\(\)]/-/g'`
D=`head -n4 $x | grep '^Date' | sed 's/Date: //' | sed 's/\//-/g'`
mv $x $D-$T.md
for x in *.md
cat $x | sed 's/\/\*//' | sed 's/\*\///' | sed 's/Title: /# /' > t && mv t $x
If I outlive Github, I can just generate this site and host it somewhere else.