I’ve not yet listened to the recording, but I suspect that’s down to bandwidth.
When in a Zoom meeting by yourself, there is only one audio/video stream for your Internet connection (and the upstream connections to the Zoom servers) to deal with. When in a conference with others, there’s far more data flying around.
The algorithms used for video calling will, typically, adjust the data rate when they start to see congestion. This means reducing the quality of the audio and/or video.
Music stresses these sorts of service more than normal speech because there are fewer gaps, more content. That means more data.
Oh, and a dirty little secret of how the Internet works: incoming data generates, and requires, some outgoing data (and vice versa) so if you are on an asymmetric Internet service, congestion in your upstream can limit the amount of downstream data you can receive.
The other thing is making sure everyone else i=on the call is either on headsets or muted. If they are not this will cause echo. There are echo cancellation algorithms but they are not perfect and are mostly designed to work with voice, not with music.
Cheers,
Keith