Last Updated on 2024-07-22 by Clay
Introduction
HuggingFace Model Hub is now a widely recognized and essential open-source platform for every one. Every day, countless individuals and organizations upload their latest trained models (including those for text, images, speech, and other domains) to this platform. It can be said that anyone working in AI-related fields frequently browses the HuggingFace platform website.
We can download the remote model on HuggingFace Hub to local, and use them friendly (but be careful, that is not any model can use for commercial!). So, how can we elegantly download models?
There are various tutorials available online for downloading models, some even utilizing multi-core, segmented downloads, and supporting resumable downloads. However, the simplest methods remain git clone
and snapshot_download()
. These methods are especially useful if you don't want the downloaded model to automatically save an original file under ~/.cache/huggingface/
.
If we discover other more efficient or simpler download methods in the future, we will document and share them in a separate article.
git clone
git clone
is indeed one of the simplest methods, but its biggest drawback is that to track file versions, it downloads an additional hidden .git
folder. This hidden folder is as large as the model itself. For example, if you download a 14GB model, you'll find that it occupies 28GB of space!
This can be quite wasteful in terms of bandwidth, so if you don't need frequent updates, it's recommended to delete the .git
folder directly.
The download method is as follows: if the repo_id
you want to download is openai-community/gpt2
, remember to prepend the URL with Huggingface's URL:
git clone https://huggingface.co/openai-community/gpt2
snapshot_download()
If you want to use snapshot_download()
, first you need to install huggingface_hub
package:
pip3 install huggingface_hub
And then:
snapshot_download(
repo_id=repo_id,
local_dir=local_dir,
local_dir_use_symlinks=False,
)
You can use the above commands to download model.
Here is a brief introduction to a few of the most important parameters:
- repo_id (str): The name of the repository to download, for example,
openai-community/gpt2
. - local_dir (str): The location to save the model.
- local_dir_use_symlinks (bool): Whether to use symbolic links pointing to the original model in the cache.
There's a small pitfall here: if the model you want to download already exists locally under ~/.cache/huggingface/
, then even if you set local_dir_use_symlinks
to False
, it will still use symbolic links pointing to the model under ~/.cache/huggingface/
.
Therefore, it is recommended to check if there is already a cache on your local machine before downloading the model. If it's difficult to verify and the cache data is not important, simply delete the folders under ~/.cache/huggingface/
before downloading. This will not affect the program's execution; at most, you will need to re-download small models when necessary.
Below is a simple script for usage.
You have no need to make the model_hub
folder, it will be create by program automatically.
amazon/chronos-t5-tiny
prajjwal1/bert-tiny
NousResearch/Hermes-2-Pro-Mistral-7B
And the above is the model_list.txt
example, you can list the model/repo name you want to download.
import argparse
import os
from huggingface_hub import snapshot_download
def main() -> None:
# Arguments
parser = argparse.ArgumentParser(description="Download a snapshot from Huggingface Model Hub")
parser.add_argument("--download_file", type=str, required=True, help="The list file of repository id")
parser.add_argument("--local_dir", type=str, default="./", help="Directory to save the downloaded snapshot")
# Parsing
args = parser.parse_args()
# Check `local_dir` is existed
os.makedirs(args.local_dir, exist_ok=True)
# Get all repo id
with open(args.download_file, "r") as f:
repo_ids = [repo_id for repo_id in f.read().splitlines() if repo_id.strip()]
# Donwload
for repo_id in repo_ids:
local_dir = os.path.join(args.local_dir, repo_id.replace("/", "--"))
if os.path.isdir(local_dir):
print(f"{repo_id} is existed, pass.")
continue
snapshot_download(
repo_id=repo_id,
local_dir=local_dir,
local_dir_use_symlinks=False,
)
print(f"\n{repo_id} is finished.\n")
if __name__ == "__main__":
main()
Here is the code for downloading models. In this code, I have included operations to automatically convert /
in the repo ID to --
, and it also automatically skips the download if the model already exists in the specified directory to avoid duplicate downloads.
#!/bin/bash
time python3 download.py \
--download_file ./model_list.txt \
--local_dir ./model_hub/
Finally, here is the script for automatic execution, where the input file and download directory are already set.
Don't forget to use chmod +x download.sh
to make it executable. After that, you can run it with ./download.sh
.
References
Read More
- [Solved] HuggingFace Transformers Model Return "'ValueError: too many values to unpack (expected 2)', upon training a Bert binary classification model"
- [Solved] huggingface/tokenizers: The current process just got forked. after parallelism has already been used. Disabling parallelism to avoid deadlocks