Oct 14: Ping Li: Probabilistic hashing for similarity searching and machine learning on large datasets in high dimensions
More recently, we realized that (b-bit) minwise hashing can not only be used for similarity matching but also for machine learning. Applying logistic regression and SVMs to large datasets faces numerous practical challenges. As datasets become larger and larger, they take too long to load and may not fit in memory. Training and testing time can become an issue. Error analysis and exploratory data analysis are rarely performed on large datasets because it is too painful to run lots of what-if scenarios and explore lots of high-order interactions (pairwise, 3-way, etc.). The proposed method has been applied to two large datasets: a “smaller” dataset (24GB in 16M dimensions) and a “larger” dataset (200GB in 1B dimensions). Using a single desktop computer, the proposed method takes 3 seconds to train an SVM for the smaller dataset and 30 seconds for the larger dataset.
UPCOMING TALKS: http://www.clsp.jhu.edu/
SUBSCRIBE: http://www.clsp.jhu.edu/