hugging face与modelscope下载模型与数据

方法1：使用 --cache-dir 指定下载目录方法2：使用 --local-dir 指定下载目录。

try2find

734人浏览 · 2025-08-19 17:29:41

try2find · 2025-08-19 17:29:41 发布

1. hugging face下载模型

方法1：使用 --cache-dir 指定下载目录

huggingface-cli download --cache-dir /your/custom/path MODEL_OR_DATASET_NAME

方法2：使用 --local-dir 指定下载目录

huggingface-cli download Qwen/Qwen3-1.7B --local-dir /data2/model_llm/Qwen3-1.7B
huggingface-cli download Qwen/Qwen2-0.5B-Instruct --local-dir /data2/model_llm/Qwen2-0.5B-Instruct
huggingface-cli download Qwen/Qwen2-1.5B-Instruct --local-dir /data2/model_llm/Qwen2-1.5B-Instruct
huggingface-cli download Qwen/Qwen2.5-3B --local-dir /data2/model_llm/Qwen2.5-3B

2. hugging face下载数据

2.1 使用 huggingface_hub 库

（1）下载整个仓库到指定目录

from huggingface_hub import snapshot_download

# 下载模型/数据集到自定义目录
snapshot_download(
    repo_id="模型或数据集ID",
    repo_type="dataset",  # 如果是数据集
    cache_dir="你的目录路径",  # 指定下载目录
    local_dir="你的目录路径",  # （新版本）直接保存到该目录，而非缓存
    local_dir_use_symlinks=False,  # 避免符号链接，直接复制文件
    resume_download=True  # 支持断点续传
)

示例：

#下载整个仓库到指定目录

from huggingface_hub import snapshot_download

output_path = "/data2/model_llm/datasets/ChnSentiCorp"
snapshot_download(
    repo_id="seamew/ChnSentiCorp",
    repo_type="dataset",  # 如果是数据集
    # cache_dir="你的目录路径",  # 指定下载目录
    local_dir=output_path,  # （新版本）直接保存到该目录，而非缓存
    local_dir_use_symlinks=False  # 避免符号链接，直接复制文件
)

（2）下载单个文件到指定路径

from huggingface_hub import hf_hub_download

file_path = hf_hub_download(
    repo_id="模型或数据集ID",
    filename="文件路径（如 config.json）",
    cache_dir="你的目录路径",  # 指定缓存目录
    local_dir="你的目录路径",  # （新版本）直接保存到目标目录
    force_filename="自定义文件名.txt"  # 可选：重命名文件
)

2.2 使用 datasets 库（仅数据集）

from datasets import load_dataset

# 下载数据集到指定目录
dataset = load_dataset(
    "数据集ID",
    cache_dir="你的目录路径",  # 指定缓存目录
    download_mode="force_redownload"  # 可选：强制重新下载
)

2.3 通过 Git 克隆到指定目录

git lfs install  # 确保安装 Git LFS
git clone https://huggingface.co/模型或数据集ID 你的目录路径

2.4 使用 wget 或 curl 直接下载

# 下载到当前目录（需替换 TOKEN 和 URL）
wget --header="Authorization: Bearer YOUR_TOKEN" \
     -O "你的目录路径/自定义文件名" \
     https://huggingface.co/模型或数据集ID/resolve/main/文件路径

示例：

git clone https://huggingface.co/datasets/openai/gsm8k /home/py_workspace/datasets/openai_gsm8k

git lfs clone https://huggingface.co/datasets/seamew/ChnSentiCorp

git lfs clone https://huggingface.co/datasets/Karsh-CAI/btfChinese-DPO-small

3. 搜索huggingface的模型与数据集

3.1 通过 Hugging Face 官网搜索

最直接的方法是访问官网的模型库和数据集库页面，使用搜索框筛选：

模型库：https://huggingface.co/models

数据集库：https://huggingface.co/datasets

在搜索框中输入关键词（如 bert 、 text-classification ），并可以通过左侧的过滤器（如任务类型、框架、语言等）进一步筛选。

3.2 使用 huggingface_hub Python 库

如果需要在代码中搜索模型或数据集，可以使用 huggingface_hub 库的 list_models() 和 list_datasets() 方法。

安装库

pip install huggingface_hub

搜索模型

from huggingface_hub import list_models

# 搜索所有 BERT 相关模型
models = list_models(search="bert", limit=10)  # limit 限制返回数量

for model in models:
    print(f"Model ID: {model.id}")
    print(f"Task: {model.pipeline_tag}")  # 任务类型（如 text-classification）
    print(f"Downloads: {model.downloads}")  # 下载量
    print("---")

搜索数据集

from huggingface_hub import list_datasets

# 搜索所有文本分类数据集
datasets = list_datasets(search="text classification", limit=5)

for dataset in datasets:
    print(f"Dataset ID: {dataset.id}")
    print(f"Tags: {dataset.tags}")  # 任务类型
    print(f"Downloads: {dataset.downloads}")
    print("---")

示例：

$ python search_hf_2.py 
Model ID: google-bert/bert-base-uncased
Task: fill-mask
Downloads: 48311761
---
Model ID: google-bert/bert-base-chinese
Task: fill-mask
Downloads: 3470669
---

$ python search_hf_data.py    
Dataset ID: Karavet/ILUR-news-text-classification-corpus
Tags: ['task_categories:text-classification', 'multilinguality:monolingual', 'language:hy', 'license:apache-2.0', 'size_categories:100K<n<1M', 'format:text', 'modality:text', 'library:datasets', 'library:mlcroissant', 'region:us']
Downloads: 320
---
Dataset ID: jakeazcona/short-text-labeled-emotion-classification
Tags: ['size_categories:10K<n<100K', 'format:parquet', 'modality:text', 'library:datasets', 'library:pandas', 'library:mlcroissant', 'library:polars', 'region:us']
Downloads: 40
---

3.3 使用 Hugging Face API（REST）

import requests

# 搜索模型
url = "https://huggingface.co/api/models"
params = {"search": "bert", "limit": 5}
response = requests.get(url, params=params).json()

for model in response:
    print(model["modelId"])

4. modelscope下载模型

https://modelscope.cn/docs/intro/environment-setup

from modelscope import snapshot_download

# 下载模型到自定义目录（不存缓存）
model_dir = snapshot_download(
    "damo/nlp_structbert_sentence-similarity_chinese-base",
    cache_dir="/your/custom/pache/path",  # 可选：修改缓存目录
    local_dir="/your/target/directory",    # 直接下载到目标目录
    local_dir_use_symlinks=False,         # 禁用符号链接（直接复制文件）
)

参数说明：

cache_dir ：修改缓存目录（默认 ~/.cache/modelscope/hub ）

local_dir ：直接下载到目标目录（推荐）

local_dir_use_symlinks=False ：避免软链接，直接存储文件

5. modelscope下载数据

pip install datasets==2.8.0 modelscope==1.4.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install pyarrow==11.0.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

from modelscope.msdatasets import MsDataset

# 下载数据集到指定目录
dataset = MsDataset.load(
    "clue/afqmc",
    subset_name="default",
    split="train",
    cache_dir="/your/custom/dataset/path",  # 修改数据集缓存目录
)

print(dataset)

import os
from modelscope.msdatasets import MsDataset

# 创建目录（如果不存在）
cache_dir = "/data2/data/refcoco"
os.makedirs(cache_dir, exist_ok=True)

try:
    # 下载并加载数据集
    dataset = MsDataset.load(
        "swift/refcoco",
        subset_name="default",
        split="train",
        cache_dir=cache_dir,
        download_mode="reuse_dataset_if_exists"
    )

    # 打印信息
    print(f"数据集类型: {type(dataset)}")
    print(f"样本数量: {len(dataset)}")
    if len(dataset) > 0:
        print("首条样本:", dataset[0])

except Exception as e:
    print("加载数据集失败:", e)

# 导入所需的库
from modelscope.msdatasets import MsDataset
import os
import pandas as pd

MAX_DATA_NUMBER = 500

# 创建目录（如果不存在）
cache_dir = "/data2/data/coco_2014_caption"
os.makedirs(cache_dir, exist_ok=True)

# 检查目录是否已存在
if not os.path.exists('coco_2014_caption'):
    # 从modelscope下载COCO 2014图像描述数据集
    ds =  MsDataset.load(
        'modelscope/coco_2014_caption', 
        subset_name='coco_2014_caption', 
        split='train', 
        cache_dir=cache_dir,
        trust_remote_code=True)

    print(len(ds))
    # 设置处理的图片数量上限
    total = min(MAX_DATA_NUMBER, len(ds))

    # 创建保存图片的目录
    os.makedirs('coco_2014_caption', exist_ok=True)

    # 初始化存储图片路径和描述的列表
    image_paths = []
    captions = []

    for i in range(total):
        # 获取每个样本的信息
        item = ds[i]
        image_id = item['image_id']
        caption = item['caption']
        image = item['image']
        
        # 保存图片并记录路径
        image_path = os.path.abspath(f'coco_2014_caption/{image_id}.jpg')
        image.save(image_path)
        
        # 将路径和描述添加到列表中
        image_paths.append(image_path)
        captions.append(caption)
        
        # 每处理50张图片打印一次进度
        if (i + 1) % 50 == 0:
            print(f'Processing {i+1}/{total} images ({(i+1)/total*100:.1f}%)')

    # 将图片路径和描述保存为CSV文件
    df = pd.DataFrame({
        'image_path': image_paths,
        'caption': captions
    })
    
    # 将数据保存为CSV文件
    df.to_csv('./coco-2024-dataset.csv', index=False)
    
    print(f'数据处理完成，共处理了{total}张图片')

else:
    print('coco_2014_caption目录已存在,跳过数据处理步骤')

6. 使用 modelscope-cli 命令行工具

下载模型（默认缓存目录）

modelscope-cli download damo/nlp_structbert_sentence-similarity_chinese-base

例子1：
modelscope download --dataset 'Tongyi-DataEngine/SA1B-Dense-Caption' --include 'data/train-000*' --local_dir '/data2/data/SA1B-Dense-Caption'
例子2：
modelscope download --dataset AI-ModelScope/LaTeX_OCR --local_dir /data2/data/LaTeX_OCR

7. modelscope所需环境

需要dataset==3.6.0，使用2.19.0时会报错

name: llmtuner
channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2025.2.25=h06a4308_0
  - ld_impl_linux-64=2.40=h12ee557_0
  - libffi=3.4.4=h6a678d5_1
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.16=h5eee18b_0
  - pip=25.1=pyhc872135_2
  - python=3.10.16=he870216_1
  - readline=8.2=h5eee18b_0
  - sqlite=3.45.3=h5eee18b_0
  - tk=8.6.14=h39e8969_0
  - xz=5.6.4=h5eee18b_1
  - zlib=1.2.13=h5eee18b_1
  - pip:
    - absl-py==2.3.1
    - accelerate==1.8.1
    - addict==2.4.0
    - aiofiles==24.1.0
    - aiohappyeyeballs==2.6.1
    - aiohttp==3.12.13
    - aiosignal==1.4.0
    - airportsdata==20250622
    - aliyun-python-sdk-core==2.16.0
    - aliyun-python-sdk-kms==2.16.5
    - annotated-types==0.7.0
    - antlr4-python3-runtime==4.9.3
    - anyio==4.9.0
    - astor==0.8.1
    - asttokens==3.0.0
    - async-timeout==5.0.1
    - attrdict==2.0.1
    - attrs==25.3.0
    - auto-gptq==0.7.1
    - autoawq==0.2.9
    - av==15.1.0
    - binpacking==1.5.2
    - bitsandbytes==0.46.1
    - blake3==1.0.5
    - boto3==1.38.36
    - botocore==1.38.36
    - brotli==1.2.0
    - cachetools==6.1.0
    - cbor2==5.7.0
    - certifi==2025.6.15
    - cffi==2.0.0
    - charset-normalizer==3.4.2
    - click==8.1.8
    - clip==1.0
    - cloudpickle==3.1.1
    - colorama==0.4.6
    - colorlog==6.10.1
    - comm==0.2.2
    - compressed-tensors==0.10.2
    - contourpy==1.3.2
    - cpm-kernels==1.0.11
    - crcmod==1.7
    - cryptography==46.0.3
    - cupy-cuda12x==13.4.1
    - cycler==0.12.1
    - dacite==1.9.2
    - datasets==3.6.0
    - debugpy==1.8.14
    - decorator==5.2.1
    - decord==0.6.0
    - deepseek-sft-63==0.1
    - depyf==0.19.0
    - dill==0.3.8
    - diskcache==5.6.3
    - distro==1.9.0
    - dnspython==2.7.0
    - docstring-parser==0.17.0
    - dotenv==0.9.9
    - einops==0.8.1
    - email-validator==2.2.0
    - et-xmlfile==2.0.0
    - evalscope==1.2.0
    - evaluate==0.4.6
    - exceptiongroup==1.2.2
    - executing==2.2.0
    - fastapi==0.115.14
    - fastapi-cli==0.0.7
    - fastrlock==0.8.3
    - ffmpy==1.0.0
    - filelock==3.18.0
    - fire==0.7.1
    - fonttools==4.57.0
    - frozenlist==1.7.0
    - fsspec==2025.3.0
    - ftfy==6.3.1
    - func-timeout==4.3.5
    - future==1.0.0
    - fuzzywuzzy==0.18.0
    - gekko==1.3.0
    - gguf==0.16.2
    - google-auth==2.43.0
    - google-genai==1.50.1
    - googleapis-common-protos==1.70.0
    - gradio==5.49.1
    - gradio-client==1.13.3
    - groovy==0.1.2
    - grpcio==1.73.1
    - h11==0.14.0
    - h5py==3.15.1
    - hf-xet==1.1.5
    - httpcore==1.0.8
    - httptools==0.6.4
    - httpx==0.28.1
    - huggingface-hub==0.34.4
    - human-eval==1.0.3
    - idna==3.10
    - imageio==2.37.2
    - immutabledict==4.2.2
    - importlib-metadata==8.7.0
    - iniconfig==2.1.0
    - interegular==0.3.3
    - ipdb==0.13.13
    - ipykernel==6.29.5
    - ipython==8.37.0
    - jedi==0.19.2
    - jieba==0.42.1
    - jinja2==3.1.6
    - jiter==0.9.0
    - jmespath==0.10.0
    - joblib==1.5.2
    - json-repair==0.53.0
    - json5==0.12.1
    - jsonlines==4.0.0
    - jsonschema==4.24.0
    - jsonschema-specifications==2025.4.1
    - jupyter-client==8.6.3
    - jupyter-core==5.8.1
    - kiwisolver==1.4.8
    - langdetect==1.0.9
    - lark==1.2.2
    - latex2sympy2-extended==1.10.2
    - levenshtein==0.27.3
    - llama-cpp-python==0.3.9
    - llguidance==0.7.30
    - llmcompressor==0.6.0
    - llvmlite==0.44.0
    - lm-format-enforcer==0.10.11
    - loguru==0.7.3
    - lxml==6.0.2
    - markdown==3.10
    - markdown-it-py==3.0.0
    - markupsafe==3.0.2
    - matplotlib==3.10.1
    - matplotlib-inline==0.1.7
    - mdurl==0.1.2
    - mistral-common==1.8.4
    - mmengine-lite==0.10.7
    - modelscope==1.31.0
    - mpmath==1.3.0
    - ms-opencompass==0.1.6
    - ms-swift==3.10.1
    - ms-vlmeval==0.0.18
    - msgpack==1.1.1
    - msgspec==0.19.0
    - multidict==6.6.3
    - multiprocess==0.70.16
    - nest-asyncio==1.6.0
    - networkx==3.4.2
    - ninja==1.11.1.4
    - nltk==3.9.2
    - numba==0.61.2
    - numpy==1.26.4
    - nvidia-cublas-cu12==12.6.4.1
    - nvidia-cuda-cupti-cu12==12.6.80
    - nvidia-cuda-nvrtc-cu12==12.6.77
    - nvidia-cuda-runtime-cu12==12.6.77
    - nvidia-cudnn-cu12==9.5.1.17
    - nvidia-cufft-cu12==11.3.0.4
    - nvidia-cufile-cu12==1.11.1.6
    - nvidia-curand-cu12==10.3.7.77
    - nvidia-cusolver-cu12==11.7.1.2
    - nvidia-cusparse-cu12==12.5.4.2
    - nvidia-cusparselt-cu12==0.6.3
    - nvidia-ml-py==12.575.51
    - nvidia-nccl-cu12==2.26.2
    - nvidia-nvjitlink-cu12==12.6.85
    - nvidia-nvtx-cu12==12.6.77
    - ollama==0.4.8
    - omegaconf==2.3.0
    - openai==1.101.0
    - openai-harmony==0.0.4
    - opencc==1.1.9
    - opencv-python==4.11.0.86
    - opencv-python-headless==4.11.0.86
    - openpyxl==3.1.5
    - opentelemetry-api==1.34.1
    - opentelemetry-exporter-otlp==1.34.1
    - opentelemetry-exporter-otlp-proto-common==1.34.1
    - opentelemetry-exporter-otlp-proto-grpc==1.34.1
    - opentelemetry-exporter-otlp-proto-http==1.34.1
    - opentelemetry-proto==1.34.1
    - opentelemetry-sdk==1.34.1
    - opentelemetry-semantic-conventions==0.55b1
    - opentelemetry-semantic-conventions-ai==0.4.9
    - optimum==1.26.1
    - orjson==3.11.4
    - oss2==2.19.1
    - outlines==0.1.11
    - outlines-core==0.2.10
    - overrides==7.7.0
    - packaging==25.0
    - pandas==2.3.1
    - parso==0.8.4
    - partial-json-parser==0.2.1.1.post6
    - peft==0.16.0
    - pexpect==4.9.0
    - pillow==11.2.1
    - platformdirs==4.3.8
    - pluggy==1.5.0
    - portalocker==3.2.0
    - prettytable==3.16.0
    - prometheus-client==0.20.0
    - prometheus-fastapi-instrumentator==7.1.0
    - prompt-toolkit==3.0.51
    - propcache==0.3.2
    - protobuf==5.29.5
    - psutil==7.0.0
    - ptyprocess==0.7.0
    - pure-eval==0.2.3
    - py-cpuinfo==9.0.0
    - pyarrow==20.0.0
    - pyasn1==0.6.1
    - pyasn1-modules==0.4.2
    - pybase64==1.4.2
    - pycountry==24.6.1
    - pycparser==2.22
    - pycryptodome==3.23.0
    - pydantic==2.11.7
    - pydantic-core==2.33.2
    - pydantic-extra-types==2.10.5
    - pydub==0.25.1
    - pyecharts==2.0.8
    - pygments==2.19.1
    - pylatexenc==2.10
    - pynvml==12.0.0
    - pyparsing==3.2.3
    - pypinyin==0.55.0
    - pytest==8.3.5
    - python-dateutil==2.9.0.post0
    - python-dotenv==1.1.1
    - python-json-logger==3.3.0
    - python-levenshtein==0.27.3
    - python-multipart==0.0.20
    - pytz==2025.2
    - pyyaml==6.0.2
    - pyzmq==27.0.0
    - qwen-vl-utils==0.0.8
    - qwen2p5-vl-sft-63==0.1
    - rank-bm25==0.2.2
    - rapidfuzz==3.14.3
    - ray==2.48.0
    - referencing==0.36.2
    - regex==2024.11.6
    - requests==2.32.4
    - rich==14.0.0
    - rich-toolkit==0.14.7
    - rouge==1.0.1
    - rouge-chinese==1.0.3
    - rouge-score==0.1.2
    - rpds-py==0.25.1
    - rsa==4.9.1
    - ruff==0.14.5
    - s3transfer==0.13.0
    - sacrebleu==2.5.1
    - safehttpx==0.1.7
    - safetensors==0.5.3
    - scikit-learn==1.7.2
    - scipy==1.15.3
    - seaborn==0.13.2
    - semantic-version==2.10.0
    - sentence-transformers==5.1.2
    - sentencepiece==0.2.0
    - setproctitle==1.3.6
    - setuptools==80.9.0
    - shellingham==1.5.4
    - simplejson==3.20.1
    - six==1.17.0
    - sniffio==1.3.1
    - sortedcontainers==2.4.0
    - soundfile==0.13.1
    - soxr==0.5.0.post1
    - stack-data==0.6.3
    - starlette==0.46.2
    - sty==1.0.6
    - swankit==0.2.3
    - swanlab==0.6.3
    - sympy==1.14.0
    - tabulate==0.9.0
    - tenacity==9.1.2
    - tensorboard==2.20.0
    - tensorboard-data-server==0.7.2
    - termcolor==3.1.0
    - threadpoolctl==3.6.0
    - tiktoken==0.9.0
    - timeout-decorator==0.5.0
    - timm==1.0.22
    - tokenizers==0.22.1
    - tomli==2.2.1
    - tomlkit==0.13.3
    - torch==2.7.1
    - torchaudio==2.7.1
    - torchvision==0.22.1
    - tornado==6.5.1
    - tqdm==4.67.1
    - traitlets==5.14.3
    - transformers==4.57.1
    - transformers-stream-generator==0.0.5
    - triton==3.3.1
    - trl==0.24.0
    - typer==0.15.2
    - typing-extensions==4.14.1
    - typing-inspection==0.4.0
    - tzdata==2025.2
    - urllib3==2.5.0
    - uvicorn==0.35.0
    - uvloop==0.21.0
    - validators==0.35.0
    - vl-learning==0.1
    - vllm==0.10.1.1
    - watchfiles==1.1.0
    - wcwidth==0.2.13
    - websockets==15.0.1
    - werkzeug==3.1.3
    - wget==3.2
    - wheel==0.45.1
    - word2number==1.1
    - wrapt==1.17.2
    - xformers==0.0.31
    - xgrammar==0.1.21
    - xlsxwriter==3.2.9
    - xxhash==3.5.0
    - yapf==0.43.0
    - yarl==1.20.1
    - zipp==3.23.0
    - zstandard==0.23.0
prefix: /data2/anaconda3/envs/llmtuner