Better Machine Learning Demands Better Data Labeling – Datanami

(everything possible/Shutterstock)

Money can’t buy you happiness (although you can reportedly lease it for a while). It definitely cannot buy you love. And the rumor is money also cannot buy you large troves of labeled data that are ready to be plugged into your partic…….

npressfetimg-588.png

(everything possible/Shutterstock)

Money can’t buy you happiness (although you can reportedly lease it for a while). It definitely cannot buy you love. And the rumor is money also cannot buy you large troves of labeled data that are ready to be plugged into your particular AI use case, much to the chagrin of former Apple product manager Ivan Lee.

“I spent hundreds of millions of dollars at Apple gathering labeled data,” Lee said. “And even with its resources, we were still using spreadsheets.”

It wasn’t much different at Yahoo. There, Lee helped the company develop the sorts of AI applications that one might expect of a Web giant. But getting the data labeled in the manner required to train the AI was, again, not a pretty sight.

“I’ve been a product manager for AI for the past decade,” the Stanford graduate told Datanami in a recent interview. “What I recognized across all these companies was AI is very powerful. But in order to make it happen, behind the scenes, how the sausage was made was we had to get a lot of training data.”

Armed with this insight, Lee founded Datasaur to develop software to automate the data labeling process. Of course, data labeling is an inherently human endeavor (at least, in the beginning of an AI project, although towards the middle or the end of a project, machine learning itself can be used to automatically label data, and synthetic data can also be generated).

Lee’s main goal with the Datasaur software was to streamline the interaction of human data labelers and to guide them through the process of creating the highest quality training data at the lowest cost. Since it targets power users who label data all day, it has created function keys that accelerate the process, among other capabilities befitting a dedicated data labeling system.

Datasaur helps customers with data labeling for NLP

But along the way, several other goals popped up for Datasaur, including the need to remove bias. Getting multiple eyeballs on a given piece of text (for NLP use cases) or an image (for computer vision use cases) helps to alleviate that. It also provides project management capabilities to clearly spell out labeling guidelines to ensure labeling standards continue to be met over time.

The subjective nature of data labeling is one of the things that makes the discipline so fraught with pitfalls. For example, when Lee was at Apple, he was asked to come up with a way to automatically label a piece of …….

Source: https://www.datanami.com/2021/12/02/better-machine-learning-demands-better-data-labeling/