I’ve seen complaints pop up on Twitter that people are getting their accounts suspended over years old tweets that happen to contain copyrighted music. So let’s say that, like me, you have a Twitter account over 10 years old and you want to go through your old tweets so you can pull any such video before Twitter does — how do you go about doing that?
Well here’s the thing: the UNIX command line is incredibly powerful if you know how to use it. In this post, I’ll show you how to use the bash shell in Linux or Mac OS/X to find those videos so that you can remove them.
The first thing you gotta do is download your entire Twitter archive. There are instructions on how to do that here. Once you put in the request, you’ll hear back from Twitter within a day when the download is ready. Expect the file to be rather large — in my case it was over 2 Gigs. Download that file and unzip it.
At the time of this writing, all of your media will be found in the folder data/tweet_media/, so cd into that directory and see how many files there are:
ls -l |wc -l
10085
Well, uh… that’s a lot of files. Let’s see how many of them are videos and if we can move that into a separate directory:
mkdir mp4
mv *.mp4 mp4/
cd mp4/
ls -l | wc -l
2270
Well, that’s a little less awful. Still a lot of files though. What if all of those files didn’t have audio? Could I narrow it down even more?
Fortunately there is a command line app called ffmpeg which can be used to convert video files, as well as inspect their contents. The specific command to do this is ffmpeg -i, however I don’t know what the output for an audio file would look like, so I guess I gotta inspect every file and then go through output after the fact. But the UNIX command line makes this easy, check this out:
for FILE in *.mp4; do ffmpeg -i ${FILE} 2>&1; done | pv -l > output.txt
55.1k 0:01:19 [ 694 /s] [ <=>]
There’s a lot going on in there, so let’s walk through it:
- for FILE in *.mp4 — This will perform a loop between do and done that will will populate $FILE for each filename that matches the *.mp4 pattern.
- 2>&1 — This tells standard error to go to the same place standard output goes, this is so that it can be piped into…
- pv — Pipeviewer! This is a neat command line app which shows the status of text that is flowing through it so that you know work is being done. From there, the text goes to
- > output.txt — Where the output will be written. We are writing the results of this operation (as well as others) to a file, because inspecting a file containing the output is waaaaaay easier than having to run a time-consuming command repeatedly.
As you can see, 55100 lines were written over the course of 79 seconds. Now that we have all of the details of every video, let’s search for strings that relate to audio:
cat output.txt | grep -i audio | wc -l
189
Fantastic! 189 matches, so let’s use head -n2 to inspect a couple, and we can see that it does look like audio data:
cat output.txt | grep -i audio |head -n2
Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 128 kb/s (default)
Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, mono, fltp, 127 kb/s (default)
So now that we know what to look for, we want to report just on files that have audio data — how do we do that? Let’s start by doing that for loop from before, but printing out just the filename (with echo) and searching for the string “Audio”:
for FILE in *.mp4; do echo ${FILE}; ffmpeg -i ${FILE} 2>&1 | grep Audio; done | pv -l > output-audio.txt
head -n6 output-audio.txt
10344303167420001–7M6TBTo3zGLZO0La.mp4
10344303167420002–7M6TBTo3zGLZO0La.mp4
10344303167420003–7M6TBTo3zGLZO0La.mp4
Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 128 kb/s (default)
10344303167420004–7M6TBTo3zGLZO0La.mp4
10344303167420005–7M6TBTo3zGLZO0La.mp4
As we can see from looking at the output file, we now have the name of every file, and some of those files have audio in them. So how do we get just the names of the files that have audio them? Grep to the rescue:
cat output-audio.txt | grep -B 1 Audio | grep “\.mp4” > files.txt
The -B switch in grep tells grep not just to print out lines that contain the string “Audio”, but to also print out 1 line before each matching line, and we know from the previous command that the line in question contains the filename. The final grep will then extract just the filename. Let’s see how many files matched:
cat files.txt | wc -l
189
Not bad — we started with over 10,000 files and we narrowed down the ones we need to search to under 200! Now let’s build on what we already know about the UNIX command line to move those files into another directory:
mkdir audio
for FILE in $(cat files.txt ); do mv ${FILE} audio/; done
cd audio
ls -l | wc -l
189
Great! We can now go through these files at our leisure! If you’re on a mac, you can that directory in a Finder window by typing open . and get started going through those files. (If you’re on Linux, I dunno what to tell you.)
There are so many more neat and interesting things you can do with the command line, and this post just barely scratches the surface. I would encourage anyone reading this post to keep digging and to check out utilities like sed, awk, tr, rsync, and especially dd to see what other exciting things you can do on the CLI.
I hope this post was helpful. Feel free to leave a comment below or find me on Twitter: https://twitter.com/dmuth
Enjoy!