Intellectual property concerns are central to the evolving legal framework surrounding generative AI and machine learning. Training large language models involves significant volumes of copyrighted text, images, audio, and video data, which raises unresolved questions about direct and vicarious copyright infringement, as well as the potential reproduction of expressive elements in outputs.
Moreover, in order to provide accurate and current answers to user prompts, many AI systems rely upon up-to-date information that is retrieved from the web in an automated, bot-based process referred to as retrieval augmented generation (“RAG”) and grounding. RAG and grounding work together by scraping the web of current information and anchoring an AI system’s output to that retrieved information so responses remain accurate, verifiable, and contextually relevant rather than relying solely on pre-trained model data.
In addition to concerns about copyrighted materials, the use of confidential, proprietary, healthcare, personal, and other sensitive information introduces further challenges related to data provenance, lawful sourcing, and the extent to which businesses can control how their information is ingested, transformed, or reused within AI systems. As AI tools increasingly rely on both licensed and unlicensed datasets to produce accurate and up-to-date outputs, these combined IP and confidentiality risks shape the contractual protections, governance practices, and compliance obligations necessary for responsible deployment.
Courts are currently assessing whether the ingestion of copyrighted material constitutes fair use and whether AI-generated outputs should be treated as derivative works, especially when they reflect stylistic or substantive similarities to protected source material. Some early U.S. federal court decisions have blessed the concept of AI model training on copyrighted content as a ‘fair-use’ in creating “spectacularly transformative” technology, while cases regarding infringing AI outputs are still in the early stages. These disputes reflect a broader concern among creators, publishers, and rights holders that unlicensed training diminishes the value of their work and undermines established licensing markets.
Click here to view the full article