facebook/bart-large-cnn · IndexError: index out of range in self

Sep 18

I get this error when using the example code. The last line in the stack trace is this:
Lib\site-packages\torch\nn\functional.py", line 2267, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

The only thing I changed is that I used a longer input text. I think it is too long. How to fix? Can I set the maximum length somehow?

JaumDezeeni

Sep 26

I was facing the same issue; I solved it by slicing the input into two pieces, summarizing each of them in larger texts, merging both of them, and then summarizing one more time. The problem is that I think a lot of information was lost.

oneGirlArmy

Sep 27

That's right. I usually break my input text into chunks of 500 tokens to resolve this.
def chunk_text_with_context(text, context, max_tokens=500):
words = text.split()
chunks = []
current_chunk = [context]
current_length = len(tokenizer.encode(context, add_special_tokens=False))

for word in words:
    word_length = len(tokenizer.encode(word, add_special_tokens=False))
    if current_length + word_length <= max_tokens:
        current_chunk.append(word)
        current_length += word_length
    else:
        chunks.append(" ".join(current_chunk))
        current_chunk = [context, word]
        current_length = len(tokenizer.encode(context, add_special_tokens=False)) + word_length

# Add the last chunk if there's any
if current_chunk:
    chunks.append(" ".join(current_chunk))

return chunks

The above is code for if you want to append a certain context to each chunk.