Ruby | Ruby_On_Rails | Programming

This is a fairly simple problem but I keep finding myself looking this up and coming to a similar solution so I thought I would post this quick trick here.

In Ruby its really easy to find a uniq collection by using Array.uniq or if you have Rails / ActiveRecord and its a database collection you can use .distinct.

I have found that when I am doing data analysis looking at what has happened over time, events, non-relational data or even relationships between models (when its relational data) - I tend to want to look at duplicate instances of data to compare them and see what is happening.

I keep coming back to this Stackoverflow question, but I find the solutions a bit too complex for a quick one off script.

My solution is to use a hash, a count and then process out what I'm looking for.

This other Stackoverflow question is a good starting point to figuring out how to create a hash of counts for elements in a collection.

Here is my code which can be modified as needed:

counts =

# counts[element] can be changed to be anything which 
# uniquely identifies the element and is supported by the Hash
# i.e. counts[element.uuid]
collection.each { |element| counts[element] += 1 }

# Once you have generated your count of occurances 
# in a collection you can run a sanity check
collection.uniq.size == counts.to_a.size

# Getting only non-unique elements
# Structure: { "elementName": 3 }
# with `to_a`: [ [ "elementName", 3 ] ] do |count|
  return true if count.second > 1

There are issues using this strategy. The main issue is that this process is done in memory. If the dataset you are working with is very large you will have memory and processing issues. I mainly use this for once off checks locally rather than on a production system.

comments powered by Disqus