Inference error

#20
by gsasikiran - opened

When try to deploy a dedicated inference endpoint for token classification task, we get the below error.

ValueError: The checkpoint you are trying to load has model type gemma2 but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

Application startup failed. Exiting.

The total error is as follows:

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1128, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 825, in __getitem__
    raise KeyError(key)
KeyError: 'gemma2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 732, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 608, in __aenter__
    await self._router.startup()
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 709, in startup
    await handler()
  File "/app/webservice_starlette.py", line 60, in some_startup_task
    inference_handler = get_inference_handler_either_custom_or_default_handler(HF_MODEL_DIR, task=HF_TASK)
  File "/app/huggingface_inference_toolkit/handler.py", line 54, in get_inference_handler_either_custom_or_default_handler
    return HuggingFaceHandler(model_dir=model_dir, task=task)
  File "/app/huggingface_inference_toolkit/handler.py", line 18, in __init__
    self.pipeline = get_pipeline(
  File "/app/huggingface_inference_toolkit/utils.py", line 276, in get_pipeline
    hf_******** = pipeline(
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/__init__.py", line 815, in pipeline
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1130, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `gemma2` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.```

try updating transformers version
transformers==4.42.3

Wait for IE to update

same error. any news @gsasikiran ?

@kargaranamir I have tried today to deploy to inference endpoints again and got a new error regarding GPU CUDA compatability with FlashAttention and sharding.

{"timestamp":"2024-07-15T07:13:41.525975Z","level":"INFO","fields":{"message":"Using default cuda graphs [1, 2, 4, 8, 16, 32]"},"target":"text_generation_launcher"}
2024/07/15 09:13:41
{"timestamp":"2024-07-15T07:13:41.525984Z","level":"INFO","fields":{"message":"Sharding model on 4 processes"},"target":"text_generation_launcher"}
2024/07/15 09:13:41
{"timestamp":"2024-07-15T07:13:41.526054Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/07/15 09:13:45
{"timestamp":"2024-07-15T07:13:45.044455Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
2024/07/15 09:13:45
{"timestamp":"2024-07-15T07:13:45.729310Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/07/15 09:13:45
{"timestamp":"2024-07-15T07:13:45.729466Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/07/15 09:13:45
{"timestamp":"2024-07-15T07:13:45.729479Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
2024/07/15 09:13:45
{"timestamp":"2024-07-15T07:13:45.729522Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
2024/07/15 09:13:45
{"timestamp":"2024-07-15T07:13:45.729773Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}
2024/07/15 09:13:49
{"timestamp":"2024-07-15T07:13:49.732782Z","level":"WARN","fields":{"message":"Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2\n"},"target":"text_generation_launcher"}
2024/07/15 09:13:49
{"timestamp":"2024-07-15T07:13:49.732782Z","level":"WARN","fields":{"message":"Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2\n"},"target":"text_generation_launcher"}
2024/07/15 09:13:49
{"timestamp":"2024-07-15T07:13:49.732783Z","level":"WARN","fields":{"message":"Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2\n"},"target":"text_generation_launcher"}
2024/07/15 09:13:49
{"timestamp":"2024-07-15T07:13:49.732783Z","level":"WARN","fields":{"message":"Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2\n"},"target":"text_generation_launcher"}
2024/07/15 09:13:50
{"timestamp":"2024-07-15T07:13:50.184999Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 253, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 217, in serve_inner\n model = get_model(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 610, in get_model\n raise NotImplementedError(\"sharded is not supported for AutoModel\")\nNotImplementedError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher"}
2024/07/15 09:13:50
{"timestamp":"2024-07-15T07:13:50.185479Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 253, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 217, in serve_inner\n model = get_model(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 610, in get_model\n raise NotImplementedError(\"sharded is not supported for AutoModel\")\nNotImplementedError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher"}
2024/07/15 09:13:50
{"timestamp":"2024-07-15T07:13:50.185853Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 253, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 217, in serve_inner\n model = get_model(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 610, in get_model\n raise NotImplementedError(\"sharded is not supported for AutoModel\")\nNotImplementedError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher"}
2024/07/15 09:13:50
{"timestamp":"2024-07-15T07:13:50.186664Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 253, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 217, in serve_inner\n model = get_model(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 610, in get_model\n raise NotImplementedError(\"sharded is not supported for AutoModel\")\nNotImplementedError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher"}
2024/07/15 09:13:51
{"timestamp":"2024-07-15T07:13:51.034332Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 253, in serve\n asyncio.run(\n\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 649, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 217, in serve_inner\n model = get_model(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 610, in get_model\n raise NotImplementedError(\"sharded is not supported for AutoModel\")\n\nNotImplementedError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/07/15 09:13:51
{"timestamp":"2024-07-15T07:13:51.034444Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 253, in serve\n asyncio.run(\n\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 649, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 217, in serve_inner\n model = get_model(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 610, in get_model\n raise NotImplementedError(\"sharded is not supported for AutoModel\")\n\nNotImplementedError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
2024/07/15 09:13:51
{"timestamp":"2024-07-15T07:13:51.037358Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 253, in serve\n asyncio.run(\n\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 649, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 217, in serve_inner\n model = get_model(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 610, in get_model\n raise NotImplementedError(\"sharded is not supported for AutoModel\")\n\nNotImplementedError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
2024/07/15 09:13:51
{"timestamp":"2024-07-15T07:13:51.132849Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
2024/07/15 09:13:51
{"timestamp":"2024-07-15T07:13:51.132865Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
2024/07/15 09:13:51
{"timestamp":"2024-07-15T07:13:51.134410Z","level":"INFO","fields":{"message":"Terminating shard"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}
2024/07/15 09:13:51
{"timestamp":"2024-07-15T07:13:51.134428Z","level":"INFO","fields":{"message":"Waiting for shard to gracefully shutdown"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}
2024/07/15 09:13:51
{"timestamp":"2024-07-15T07:13:51.234558Z","level":"INFO","fields":{"message":"shard terminated"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}
2024/07/15 09:13:51
Error: ShardCannotStart```

try updating transformers version
transformers==4.42.3

@saireddy How can I update it? Should I clone and then deploy?

Google org

Hi @gsasikiran , You can use !pip install -U transformers or can install specific version using !pip install transformers==4.42.3.

@Renu11 how do I do it in IE?

ValueError: Trying to set a tensor of shape torch.Size([4096, 3584]) in "weight" (which has shape torch.Size([7340032, 1])), this look incorrect.

not working with textgen web ui, everything updated.

Sign up or log in to comment