Главная запись

Account создан для того, чтобы читать интересных мне авторов. Ну и иногда записывать разные интересные вещи.

Prefix Tree vs Hash Table

Prefix trees are known as an alternative implementation for dictionaries/maps. Applied like this, they sometimes significantly outperform hash tables. In particular, it works well when key space consists of thousands not-very-long strings. I saw this in action as a part of a smart order router code at a big sell-side firm where the keys were US stock tickers. There a prefix tree replaced custom made hash map in C++ code. Also, I recently developed a similar thing for a video streaming startup where keys were cable channel codes. Here prefix tree serves as a replacement for the built in map in Go.

Performance considerations aside, it is interesting that prefix tree dictionaries can do something that hash table dictionaries cannot. Consider the following problem. We are given a programmer's resume as an input and we want to tag the resume with all the standard software dev. skills we know of. The skills are listed in some dictionary. An obvious solution is to split the resume into individual words (tokenize). Then we can look up things like "JavaScript", "Python" and "SQL" in our dictionary. This approach quickly becomes problematic when we consider skills like ".NET" and "data integration". Once these two are added to the dictionary we can no longer use blank (white space character) and dot (punctuation character) as separators. Then tokenization becomes problematic. If we do not have tokens what are we going to hash? So hash table based implementation becomes problematic. Prefix tree can handle this scenario without tokenization. We can "feed" our input string (the resume) to the tree character-by-character starting at the root every time. And we would have to keep a set of all the nodes reached/matched so far. Obviously, something like this would require adding an unconventional method to the class. Usual get/set key methods would not be enough.

Some Really Knowledgeable Person (nponeccop) told me that such a solution would be relatively slow (O(n^2) in the worst case, I think) and Aho–Corasick algorithm is the right way to solve the problem for real (in production). Still, prefix tree can totally be used for a coding interview (one hour or less) where Aho-Corasick could be mentioned but probably is too heavy to implement.

This is inspired by the post and discussion (in Russian), here: https://jakobz.livejournal.com/268624.html. Special thanks to jakobz and nponeccop.

Fun with Go channels

This post is about non-obvious, cool things we can do with Go channels. I want the material to be readable by my Java, C#, Python and C++ friends (you know who you are ;) So the first half of the post is a basic presentation of Go channels. Also, Go code that I show here is not as idiomatic as it should be for a production code in a Go shop. Again, I am trying to make specific points and keep the whole thing accessible for non Go programmers. Bring your Java (or whatever) and your multithreading knowledge (this one is a must) and you should be all right. My Go friends could safely skip the first half or so and jump to the second.

One of the main selling points of Go is its concurrency features. You can easily run functions asynchronously like this:
func foo(j int) {
	fmt.Printf("running #%v\n", j)
}

func main() {
	for i := 0; i < 5; i++ {
		go foo(i) // foo is a regular func that we just choose
		// to run in async mode here with go keyword
	}
}

Go standard lib has mutex, thread safe map and other conventional multithreading goodies. Still, the idiomatic method of synchronization built into the language is channels. Here’s an example of foo actually “returning” something to the caller:
var ch = make(chan int) // Make the channel: chan keyword.

func foo(j int) {
	fmt.Printf("running #%v\n", j)
	ch <- j // Send on the channel
}

func main() {
	for i := 0; i < 5; i++ { // This stays the same as before.
		go foo(i)
	}
	for j := 0; j < 5; j++ {
		res := <- ch // Receive all the results, in
		// no particular/deterministic order
		fmt.Printf("result %v\n", res)
	}
}
It is intuitively clear how to send and receive things on a channel. Both send and receive are blocking operations by default. Specifically, sender blocks until there is a receiver “on the other side” ready to read from the channel.

We can selectively and explicitly make channel operations non-blocking. This also allows for some interesting and non-obvious applications. Let's change the problem for now: we do not care about “returning” things from async functions. A new feature we want instead is to limit the number of those async functions (they are called “go routines” in Go) being run at the same time.
var ch = make(chan bool, 3) // This is a buffered channel; 3 is the buffer size.

func foo(j int) {
	fmt.Printf("running #%v\n", j)
	<-ch // Ignore the value read from the channel. The point is to “unload” one value from the buffer.
	// Actual value does not matter here: send and receive are only used for synchronization.
}

func main() {
	for i := 0; i < 5; i++ {
		ch <- true // First 3 send operations on the channel are non-blocking b/c of the buffer size.
		// So the first 3 go routines (see below) are started right away.
		// The 4th send operation blocks until a value is received from the channel by one of the
		// first three go routines. foo is coded so that it finishes right after receive operation.
		// So only 3 go routines get to run at the same time: one of them must receive from the channel
		// before another could be run. 
		go foo(i)
	}
}

There is nothing truly original in this post so far: you can google all the above. Now we are getting to the piece that is not readily accessible. Let’s work with the last code snippet and add one more requirement: we want to wait for all the go routines that we started to finish. Most of the "real life" code I've seen, achieves this using sync.WaitGroup object. WaitGroup is another useful multithreading/synchronization type in Go standard library that has methods like Add, Done and Wait. However, introducing the second synchronization object here is really unnecessary: we can finish the job using only that channel we started with.
var ch = make(chan bool, 3) // The same buffered channel.

func foo(j int) {
	fmt.Printf("running #%v\n", j)
	<-ch // Same as before.
}

func main() {
	for i := 0; i < 5; i++ {
		ch <- true // Same as before.
		go foo(i)
	} // This loop sends 5 values to the channel. foo go routines receive 5 values.
	// So if we waited “long enough” the channel would become empty.
	// Let's wait "just enough" instead!

	for j := 0; j < 3; j++ { // 3 is the buffer size AKA the channel capacity
		ch <- true
	}
	// Each go routine receives just one previously sent value.
	// After all that happens there are no more receivers.
	// So the only way to send 3 more values in the second loop is to completely fill the channel buffer.
	// For this to happen all go routines must receive from the channel (and finish).
	// Thus by the time the 2nd loop is done all go routines must also be done!
}
That’s it for today. Here’s the link to a more idiomatic (and runnable "in browser") implementation of the last example. It has a bunch of extra diagnostic print statements. They, hopefully make it easier to understand what’s going on.

Bonus problem: try adding one more requirement. Let’s make those go routines “return” some result (say a boolean) in addition to everything we’ve done so far.

From Captain Obvious

On a (Python) interview a candidate is asked to consider a collection of values and answer if all of the values are the same. The answer was to populate a set with those values (one-liner in Python), then check its length. Why is it not a great idea and what should we do instead (maybe)?