When I created my first dataset on Hugging Face, I had to choose a license. There were so many options that I got confused, so I tried to figure out which license type is most appropriate for my dataset.
Creative Commons (CC) license allow content creators to share their work with others and specify exactly how they wish others to use their work under copyright law. There are six main types of CC licenses:
CC BY Attribution. You are free to distribute, remix, tweak and build upon the work as long as the original author is credited
CC BY-SA Attribution ShareAlike. You are free to use as with CC-BY as long as the new work is licensed under the same terms
CC BY-NC Attribution NonCommercial. You can adapt, remix etc the original work but you cannot do so commercially
CC BY-NC-SA Attribution NonCommercial ShareAlike. You can adapt, remix etc non-commercially but must share the new work under the same terms
CC BY-ND (NoDerivatives). You can redistribute the work but it must remain unchanged and the original author credited
CC BY-NC-ND Attribution NonCommercial NoDerivatives. You can share the work new but must not change it in any way or do so commercially
Eclipse Public License is a free, open-source license maily used by the Eclipse Foundation, allowing contributors to modify and distribute software while requiring that modified source code remains under the EPL.
Berkeley Software Distribution licences are a family of permessive free software licenses, imposing minimal restrictions on the use and distribution of covered software.
PostgreSQL License is a liberal, OSI-approved open-source license that allows free, unrestricted use, modification, and distribution of the software for any purpose, including commercial products, without fees or required written agreements
GNU is a free, copyleft licence that allows to share and change all versions of a program--to make sure it remains free software for all its users. The General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. There are GNU's licence families:
GNU General Public License is is used by most GNU programs, and by more than half of all free software packages
GNU Affero General Public License is based on the GNU GPL, but has an additional term to allow users who interact with the licensed software over a network to receive the source for that program
GNU Lesser General Public License is used by a few (not by any means all) GNU libraries.
GNU Free Documentation License is a form of copyleft intended for use on a manual, textbook or other document to assure everyone the effective freedom to copy and redistribute it, with or without modifications, either commercially or non-commercially.
Llama Community License Agreement is a custom, non-open-source license created by Meta Platforms to govern the use, modification, and distribution of its Llama series, allowing research and commercial use for most users. Licence is enforcing specific restrictions - such as a user-base cap - that protect Meta's commercial interests and prevent the use of Llama to improve competing models.
Microsoft Public License (Ms-PL) is an open source software license for use in software written by Microsoft, but forbids using contributor trademarks and includes a patent grant.
Intel Research Use License Agreement
Intel Research Use License Agreement is a legal contract governing the access, download, and use of Intel-provided materials (software, datasets, tools) specifically for internal, non-commercial research and development purposes. It restricts distribution, prohibits reverse-engineering, and often mandates that materials be used only with Intel components.
Mozilla Public License (MPL)
Mozilla Public License is an open source/free software license that allows for free use, modification, and distribution of MPL software, including for proprietary projects.
Personal data is most unique and intimate aspect of individual that must be safeguarded. Data contained personal records must be into account at all times: from developing to evaluation of the product. Records should be encrypted, pseudonymized or anonymized wherever possible. It relates to internal, customer, and third-party data.
Be aware to report on every user in the database in the short term: the number of records, access levels, and removal procedures.
Review permissions for data storage and databases (especially sensitive ones) almost every week. Give access only to specific project team members or those with direct assignments.
Triple-check the recipients and the content of what you are sending. Limit the address list to only main participants to ensure that no data about other individuals is shared with unintended recipients.
Limit the use of personal records in reports or slides. If you have to, make them unrecognizable (blurring, number ranges).
Data is powerful, when used legally