The topic of privacy and security is of utmost concern when dealing with AI, and specifically with generative AI. It has ethical, moral and legal implications, as can be seen by some of the recent data privacy lawsuits.
One school of thought finds that open source is safer and more private. OpenAI, for example, is trained on open sources like GitHub and Wikipedia.
Some of the privacy concerns arise when it comes to how these solutions are using the data users prompt. There were cases when users were able to receive other users’ private data, like phone numbers, as a response to their prompt. However, commercial solutions are not necessarily safer.
When tuning your own model you own the data lifecycle and can control which data is going in. It’s important to ensure that data pipelines filter out private information and do not emit it. This call center demo example, uses an open source PII (personally identifiable information) recognizer to filter out private data patterns, like names, social security numbers, credit cards, etc. This ensures private data will not appear in the results when tuning or prompting the model.