Scala with Text Files and ElasticSearch
I seriously don’t know why I keep doing this to myself. I know learning new things I something I need to do, but why Scala? I’m perfectly happy writing Python all day long. It’s straight forward and concise, no boilerplate, no re-inventing the wheel. I’ve written pipelines that crunch hundreds of TBs of data in Python, so all the snotty people who complain about Python not being fast enough or whatever can go hangout with this cow, looks like he could use a friend. This is something I’ve been meaning to do for awhile. Use Scala to read some text file(s), and store the data somewhere with some client. I chose ElasticSearch. I really just wanted practice doing something simple like reading files and I was curious about how good the Scala clients are for popular tools.
More musings on Scala.
Starting with the end in mind. I have to say it’s been a few weeks since I’ve decided that I needed to do penance and write Scala, so I was easily frustrated taking on this simple little task. My problem is that when I read books on Scala, read Scala code online, I can tell it’s just different from what I’m used to. It’s obvious the approach to solving problems and writing code in Scala just comes with a different mindset. Immutable values, throwing away thoughts of looping through lists… it’s just all so different for someone writing Python all day. (Code available on GitHub)
Scala isn’t that hard when I write it like I would Python 🙂 , what can I say? I’m just a creature of habit. Writing some object
that extends app
and has a bunch of defined functions
just sitting around seems to be how all my Scala turns out.
I did turn over a new leaf or two going through this exercise though. (Read a text file and store the sentences into ElasticSearch for later searching.)
What I’m starting to love about Scala.
I didn’t think I would every say that about Scala the first few times I wrote it. Just figuring out how to add dependencies to my .sbt
file had me sweating bullets for a day. Anyways, not that this is particular to Scala that much, but just the chaining
together of iterables in I have found to be quite satisfying…. for example….
paragraphs.map(f => f.split('.')).map(f => f.mkString.replace("\n", "")).toList
I don’t know there is just something nice about that. Seems concise and to the point. I feel like it’s a idiom I just don’t see used in other languages. I’ve also come to love statically typing everything. I mean I always type hint my Python code, but it feels different in Scala. I’m honestly always thinking twice as hard about inputs and outputs in my Scala code, I think about it way more then when I’m writing a Python function, it’s like an afterthought for me there. I just enjoy the clarity of saying this function returns a ElasticSearch client response hit
.
def pull_hit_sentence(hit: SearchHit)
It just feels like there is less ambiguity in my Scala code then what I’m used to in Python.
Moving on… Scala with a text file in ElasticSearch.
I’m just going to code dump here. Probably some of the worse Scala ever written but you know, gotta start somewhere. A few things of note that I learned from writing this code.
- I have to search way harder to find a good client for Scala then I do for Python.
- Sometimes the verbosity and boilerplate I have to write in Scala to use some tools absolutely drives me crazy.
- I need to learn about async/concurrent Scala, but I think I will probably die first.
- I find my Scala functions to be smaller and more concise then I would probably write in Python.
- I find it funny that somehow Java crap always finds its way into Scala code.
- What the crap is a Scala
case class
again? Besides not having to specifynew
keyword?
I found elastic4s to be the easiest Scala ElasticSearch client to use.
I ran a local Docker ElasticSearch image by running…. docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.9.0
, thanks to this smart dude.
import com.sksamuel.elastic4s.http.JavaClient
import com.sksamuel.elastic4s.{ElasticClient, ElasticProperties}
import com.sksamuel.elastic4s.requests.common.RefreshPolicy
import com.sksamuel.elastic4s.ElasticDsl._
import com.sksamuel.elastic4s.requests.searches.SearchHit
import scala.io.Source
case class Book(title: String, author: String, file_number: Int, publish_year: Int, file_uri: String)
object theos extends App {
val props = ElasticProperties("http://127.0.0.1:9200")
val client = ElasticClient(JavaClient(props))
def create_elastic_sentence_index(client: ElasticClient) = {
client.execute {
createIndex("sentence")
}.await
}
def write_sentence_to_es(client: ElasticClient, book: Book, sentence: String): Unit = {
client.execute {
indexInto("sentence").fields(
"title" -> book.title,
"author" -> book.author,
"year" -> book.publish_year,
"sentence" -> sentence
).refresh(RefreshPolicy.Immediate)
}.await
}
def search_keyword(client: ElasticClient, keyword: String) = {
val resp = client.execute {
search("sentence").query(keyword)
}.await
resp
}
def breakdown_text(book: Book): List[String] = {
val paragraphs = Source.fromFile(book.file_uri).mkString.split("\\n\\n") // split book into paragraphs
val sentences = paragraphs.map(f => f.split('.')).map(f => f.mkString.replace("\n", "")).toList
sentences
}
def pull_hit_sentence(hit: SearchHit) = {
val sentence: String = hit.sourceAsMap("sentence").toString
println(sentence)
}
val b = Book("Confessions", "St. Agustine", 1, 1200, "src/main/scala/com.theos/confessions.txt")
val ss = breakdown_text(book = b)
create_elastic_sentence_index(client)
for (s <- ss){
write_sentence_to_es(client, b, s)
}
val resp = search_keyword(client,"faith")
val hits = resp.result.hits.hits.toList
hits.map(pull_hit_sentence)
client.close()
}
I don’t really have much to say about my code other then some stuff I mentioned earlier.
I’ve enjoyed learning to not write for loops and stick to stringing methods together, it just seems to make a lot of sense to code this way.
def breakdown_text(book: Book): List[String] = {
val paragraphs = Source.fromFile(book.file_uri).mkString.split("\\n\\n") // split book into paragraphs
val sentences = paragraphs.map(f => f.split('.')).map(f => f.mkString.replace("\n", "")).toList
sentences
}
It’s nice when Scala actually makes something easy to do for once…. like read a text file.
import scala.io.Source
Source.fromFile(book.file_uri)
I wonder what ole’ St. Augustine would think of me battering his precious Confessions with my terrible Scala?