Using Python to get metadata from your files, and write it to CSV

What is metadata?
Capturing metadata in Python
Export your metadata to CSV
Links

I work as a Developer at the BFI National Archive, and a recent project saw the Data and Digital Preservation Department augment technical AV file metadata into our Collections Information Database (CID) fields. Now colleagues can search on very specific technical information across our digitised collections. The Python for this project uses MediaArea’s MediaInfo for Audiovisual file metadata extraction, and ExifTool for documents. It extracts a media file’s metadata and stores it into a text file while the media file completes an automated ingest into our long-term preservation infrastructure and onto magnetic data tape, before being deleted. A little later the metadata text file is opened, read and transformed into XML string data, for Python to post it to the file’s new CID digital media record using our custom RestAPI library.

A RAWcooked FFV1 Matroska file’s technical metadata in it’s Digital Media Record

Ever wish you could magic the metadata from a folder full of AV files into a nice CSV file? Well this blog will hopefully get you to that point, using MediaInfo software to extract metadata from your media files, and cutting it up in Python before writing it to CSV. We’re going to use the Python standard library for this, though I have links at the end to some promising looking PyPi metadata extration projects! On my GitHub repository Metadata-extraction, I’ve added some MKV files, JSON metadata examples and two Python scripts you can use to help build your CSVs…

Metadata is just data about data…

“Metadata is just data of data, you see?” Woolvs, Metadata

For those who are unsure what technical metadata is, it is information that describes your digital file but is not descriptive of the content of the file. So for example the metadata for the video above might look like this extract of an H.264 MP4 below. It tells you very little about the video imagery, and nothing about it’s trams and mobile phones.

General
Complete name                            : metadata_test.mp4
Format                                   : MPEG-4
Format profile                           : Base Media / Version 2
Codec ID                                 : mp42 (mp42/mp41)
File size                                : 263 MiB
Duration                                 : 1 min 48 s
Overall bit rate                         : 20.3 Mb/s
TIM                                      : 00:00:00:00
TSC                                      : 25
TSZ                                      : 1

Video
ID                                       : 1
Format                                   : AVC
Format/Info                              : Advanced Video Codec
Format profile                           : Main@L4.1
Format settings                          : CABAC / 4 Ref Frames
Format settings, CABAC                   : Yes
Format settings, Reference frames        : 4 frames
Format settings, GOP                     : M=4, N=25
Codec ID                                 : avc1
Codec ID/Info                            : Advanced Video Coding
Duration                                 : 1 min 48 s
Bit rate                                 : 20.0 Mb/s
Width                                    : 1 920 pixels
Height                                   : 1 080 pixels
Display aspect ratio                     : 16:9
Frame rate mode                          : Constant
Frame rate                               : 25.000 FPS
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Progressive
Bits/(Pixel*Frame)                       : 0.386
Stream size                              : 258 MiB (98%)
Language                                 : English
Color range                              : Limited
Color primaries                          : BT.709
Transfer characteristics                 : BT.709
Matrix coefficients                      : BT.709

Audio
ID                                       : 2
Format                                   : AAC LC
Format/Info                              : Advanced Audio Codec
Codec ID                                 : mp4a-40-2
Duration                                 : 1 min 48 s
Source duration                          : 1 min 48 s
Bit rate mode                            : Constant
Bit rate                                 : 317 kb/s
Channel(s)                               : 2 channels
Channel layout                           : L R
Sampling rate                            : 48.0 kHz
Compression mode                         : Lossy
Stream size                              : 4.10 MiB (2%)
Source stream size                       : 4.10 MiB (2%)
Language                                 : English

It does tell you all about the digital file format, video and audio codecs, duration, bit rates, width and height, colour space… all really necessary information for archives who preserve these assets long into the future. Making metadata visible helps us to ensure the format remains accessible over time, as formats and codecs change and become obsolete.

The example above has been made using MediaArea’s MediaInfo software, using the text layout which is nice and easy to read for humans. But for Python, reading this kind of data is way easier when you export it formatted in JSON. You can see what that looks like by piping your MediaInfo metadata output to Python’s json.tool from a command line interface where both are installed:

mediainfo --output=JSON metadata_test.mp4 | python3 -m json.tool

{
    "media": {
        "@ref": "metadata_test.mp4",
        "track": [
            {
                "@type": "General",
                "VideoCount": "1",
                "AudioCount": "1",
                "FileExtension": "mp4",
                "Format": "MPEG-4",
                "Format_Profile": "Base Media",
                "CodecID": "mp42",
                "CodecID_Compatible": "mp42/mp41",
                "FileSize": "275406280",
                "Duration": "108.440",
                "OverallBitRate": "20317689",
                "FrameRate": "25.000",
                "FrameCount": "2711",
                "StreamSize": "81389",
                "HeaderSize": "79680",
                "DataSize": "275326600",
                "FooterSize": "0",
                "IsStreamable": "Yes",
                "extra": {
                    "TIM": "00:00:00:00",
                    "TSC": "25",
                    "TSZ": "1"
                }
            },
            {
                "@type": "Video",
                "StreamOrder": "0",
                "ID": "1",
                "Format": "AVC",
                "Format_Profile": "Main",
                "Format_Level": "4.1",
                "Format_Settings_CABAC": "Yes",
                "Format_Settings_RefFrames": "4",
                "Format_Settings_GOP": "M=4, N=25",
                "CodecID": "avc1",
                "Duration": "108.440",
                "BitRate": "19994368",
                "Width": "1920",
                "Height": "1080",
                "Stored_Height": "1088",
                "Sampled_Width": "1920",
                "Sampled_Height": "1080",
                "PixelAspectRatio": "1.000",
                "DisplayAspectRatio": "1.778",
                "Rotation": "0.000",
                "FrameRate_Mode": "CFR",
                "FrameRate": "25.000",
                "FrameCount": "2711",
                "ColorSpace": "YUV",
                "ChromaSubsampling": "4:2:0",
                "BitDepth": "8",
                "ScanType": "Progressive",
                "StreamSize": "271023659",
                "Language": "en",
                "colour_description_present": "Yes",
                "colour_description_present_Source": "Stream",
                "colour_range": "Limited",
                "colour_range_Source": "Stream",
                "colour_primaries": "BT.709",
                "colour_primaries_Source": "Stream",
                "transfer_characteristics": "BT.709",
                "transfer_characteristics_Source": "Stream",
                "matrix_coefficients": "BT.709",
                "matrix_coefficients_Source": "Stream"
            },
            {
                "@type": "Audio",
                "StreamOrder": "1",
                "ID": "2",
                "Format": "AAC",
                "Format_AdditionalFeatures": "LC",
                "CodecID": "mp4a-40-2",
                "Duration": "108.440",
                "Source_Duration": "108.480",
                "BitRate_Mode": "CBR",
                "BitRate": "317324",
                "Channels": "2",
                "ChannelPositions": "Front: L R",
                "ChannelLayout": "L R",
                "SamplesPerFrame": "1024",
                "SamplingRate": "48000",
                "SamplingCount": "5205120",
                "FrameCount": "5083",
                "Compression_Mode": "Lossy",
                "StreamSize": "4301232",
                "Language": "en"
            }
        ]
    }
}

Capturing metadata in Python

The subprocess module is great if you want to capture metadata from a piece of software using command line. You don’t need to install subprocess as it comes installed as part of the Python standard library.

Python subprocess spawns a new shell which runs your command and captures the results in your code. It helps if you have used software from a command prompt like Terminal before! Note: You must have the command line version of MediaInfo software installed to your computer to use the subprocess command above from a Python shell, see install details here.

Let’s look at an example subprocess call to capture the output of an audiovisual file. This saves the metadata and lets you call it from a ‘mdata’ variable using the stdout key – you can use any characters to make a variable name of you choice. This output will be returned in bytes formatting, but this will convert to a Python dictionary object when using the json.loads() call after import the Python standard library module – json. The lines starting with # are comments for extra context!

# Import the python modules you need to work with
import subprocess
import json

# Build your command, a list filled with strings
command = [
    "mediainfo", "-f",
    "--output=JSON"
    "<path to file here>"
]

# You subprocess call with arguments to capture output
mdata_bytes = subprocess.run(
    command,
    shell=False,
    capture_output=True
)

# Load the mdata.stdout string into a dictionary using json
metadata = json.loads(mdata_bytes.stdout)

If you want to try out the code above then you need access to a Python shell (or REPL, Read-Eval-Print-Loop – great guidance here), or a code editor like VScode or PyCharm. You will also need MediaInfo CLI installed on your computer.

But, if you want to just play with some metadata (free JSON examples in my repository) then you can try out the code below by launching a web hosted Python interpreter, like ezpy.io, which lets you make files using your metadata examples. You can then open, read, edit and create a CSV export! To get to the same point as our Shell users above you can run this code in your web interpreter:

# Create a variable and add a multi-line string of JSON metadata

mdata = '''
{
"creatingLibrary":{"name":"MediaInfoLib","version":"25.07.1","url":"https://mediaarea.net/MediaInfo"},
"media":{"@ref":"MKV_sample.mkv","track":[{"@type":"General","UniqueID":"272866968838251533648107611816904232186",
"VideoCount":"1",
"AudioCount":"2",
"FileExtension":"mkv",
"Format":"Matroska",
"Format_Version":"4",
"FileSize":"8149026",
"Duration":"10.000",
"OverallBitRate_Mode":"VBR",
"OverallBitRate":"6519221",
"FrameRate":"25.000",
"FrameCount":"250",
"StreamSize":"168789",
"IsStreamable":"Yes",
"Encoded_Application":"Lavf58.76.100",
"Encoded_Library":"Lavf58.76.100",
"extra":{"ErrorDetectionType":"Per level 1"}},{"@type":"Video","StreamOrder":"2",
"ID":"3",
"UniqueID":"9899671846087928484",
"Format":"FFV1",
"Format_Version":"3.4",
"Format_Settings_GOP":"N=1",
"Format_Settings_SliceCount":"24",
"CodecID":"V_MS/VFW/FOURCC / FFV1",
"Duration":"10.000000000",
"BitRate_Mode":"VBR",
"BitRate":"1781489",
"Width":"720",
"Height":"576",
"PixelAspectRatio":"1.455",
"DisplayAspectRatio":"1.818",
"FrameRate_Mode":"CFR",
"FrameRate":"25.000",
"FrameRate_Num":"25",
"FrameRate_Den":"1",
"FrameCount":"250",
"Standard":"PAL",
"ColorSpace":"YUV",
"ChromaSubsampling":"4:2:2",
"BitDepth":"10",
"ScanType":"Interlaced",
"ScanOrder":"TFF",
"Compression_Mode":"Lossless",
"Delay":"0.000",
"Delay_Source":"Container",
"TimeCode_FirstFrame":"00:00:09:23",
"TimeCode_Source":"Matroska tags",
"StreamSize":"2226861",
"Encoded_Library":"Lavc58.54.100 ffv1",
"Language":"en",
"Default":"Yes",
"Forced":"No",
"colour_description_present":"Yes",
"colour_description_present_Source":"Container",
"colour_range":"Limited",
"colour_range_Source":"Container",
"colour_primaries":"BT.601 PAL",
"colour_primaries_Source":"Container",
"transfer_characteristics":"BT.709",
"transfer_characteristics_Source":"Container",
"matrix_coefficients":"BT.470 System B/G",
"matrix_coefficients_Source":"Container",
"extra":{"coder_type":"Range Coder","MaxSlicesCount":"24","ErrorDetectionType":"Per slice"}},{"@type":"Audio","@typeorder":"1","StreamOrder":"0",
"ID":"1",
"UniqueID":"10335038327216613133",
"Format":"PCM",
"Format_Settings_Endianness":"Little",
"Format_Settings_Sign":"Signed",
"CodecID":"A_PCM/INT/LIT",
"Duration":"9.978000000",
"BitRate_Mode":"CBR",
"BitRate":"2304000",
"Channels":"2",
"SamplingRate":"48000",
"SamplingCount":"478944",
"BitDepth":"24",
"Delay":"0.021",
"Delay_Source":"Container",
"Video_Delay":"0.021",
"TimeCode_FirstFrame":"00:00:09:23",
"TimeCode_Source":"Matroska tags",
"StreamSize":"2873664",
"Language":"en",
"Default":"Yes",
"Forced":"No"},{"@type":"Audio","@typeorder":"2","StreamOrder":"1",
"ID":"2",
"UniqueID":"14449682800989093866",
"Format":"PCM",
"Format_Settings_Endianness":"Little",
"Format_Settings_Sign":"Signed",
"CodecID":"A_PCM/INT/LIT",
"Duration":"9.999000000",
"BitRate_Mode":"CBR",
"BitRate":"2304000",
"Channels":"2",
"SamplingRate":"48000",
"SamplingCount":"479952",
"BitDepth":"24",
"Delay":"0.000",
"Delay_Source":"Container",
"Video_Delay":"0.000",
"TimeCode_FirstFrame":"00:00:09:23",
"TimeCode_Source":"Matroska tags",
"StreamSize":"2879712",
"Language":"en",
"Default":"No",
"Forced":"No"}]}
}
'''

# Create and write that mdata variable into a new file
with open("metadata1.txt", "w+") as data:
    data.write(mdata)

Substitute your own JSON exported AV metadata between the “”” markers (MediaInfoOnline lets you export metadata), or copy paste some from the repository examples here. The code snipped above will create your metadata1.txt file within the ezpy.io client folder system, a very simple folder structure /home/pyodide and /home/web_user. Now we can open and import it again! It may seem excessive to create a file, but we’re going to want a couple of files saved into your ezpy.io web folders later on to make our CSV from! Now clear your ezpy.io window and let’s look at your file!

# Import the Python os library and print os list directory
import os
print(os.listdir())

If you file is there, then let’s open it using a very similar command to the one used to create your file. We’ll also import the Python JSON library so that we can convert the string of text into a Python dictionary object.

# Import the Python json library
import json

# Open and read file contents into a new file
with open("metadata1.txt", "r") as data:
    mdata = data.read()

# Convert the string from the read text file into dictionary
metadata = json.loads(mdata)

You now have lovely dictionary metadata you can work with! Carry on with the examples below and try to access as many metadata fields as you can.

The MediaInfo json dictionary has a “media” key, which contains the track listings. So to access the tracks and all their data, you first need to navigate your way to that data by calling the dictionary variable (ie mdata) and using the “.get()” command containing the name of the key you want to get the value for. The tracks are kept in a list within “media/track”. Each track within the list contains a dictionary of key-value pairs containing the metadata. You can run a ‘for loop’ (examples below) that move through tracks one at at time – and match the track “@type” key to the MediaInfo track names:
– General
– Video
– Image
– Audio
– Text
– Other

# Extract the track data in a list from the dictionary
tracks = mdata.get(“media”).get(“track”)

# Iterate track list and print the 'General' track
for target_track in tracks:
    if track["@type"] == “General":
        print(track)

# Print out the Format value of the 'Video' track
for target_track in media:
    if track["@type"] == “Video":
        print(track.get(“Format"))

# Print duration from each track, use f-string to differentiate
for track in media:
    if track["@type"] == “General”:
        print(f”General duration: {track.get(“Duration”)})
    elif track.get("@type") == “Video":
	    print(f”Video duration: {track.get(“Duration”)})
    elif track.get(“@type”) == “Audio”:
        print(f”Audio duration: {track.get(“Duration”)})

This way you can export any metadata from your file, but watch out for metadata like the FFV1 Max Slices Counts, which when exported as JSON sit within an additional dictionary key layer called ‘extra’. To access this you will need to make an extra dictionary .get():

for track in media:
    if track["@type"] == “Video”:
        d = track.get(“extra”).get("MaxSlicesCount"))
        print(d)

Export your metadata to CSV

I had a bit of fun recently making some scripts to used for two different use cases. They are both metdata extraction scripts that aim to populate a CSV with the metadata found. You can find both of the scripts in my GitHub repository, links below. I’ve tried to add descriptions of what the code does in the functions of the script, but if you want any more information please just drop me a message and I’ll be happy to chat more! Caveat: These scripts won’t retrieve multiple video or audio stream data into your CSV file, just the first. I hope they are still useful aids to get you started!

get_metadata.py

This piece of code has been written for those running Python and MediaInfo CLI on their laptops. It’s a standalone Python script that should allow the user to run the code without the need for Python Environment or any tricky code installations. You just need to copy the script into your local file system, call python3 the code and then supply an absolute path to a folder full of AV files.

python3 get_metadata.py "/mnt/path_to_project/av_files"

The code will then read through all of your files generating JSON metadata. At the top of the script is a list of dictionary entries, they keys of which will form CSV column headers (these are BFI database field names for our metadata prokect) the values of which match the metadata in the MediaInfo JSON output. For every file’s metadata extracted and matched to this database, a new row will be added into the CSV until all files in your folder are listed. So if you havè a folder for a new project with 100 files in, you should get a CSV with 101 lines in, starting with your column header titles.

Here’s an example output for some screencast videos I recently recorded on my Linux laptop for a presentation! The metadata gets neatly placed into it’s header column for every file, with the filename is the first column to help understand what you’re seeing.

The script also prints out confirmation of the file it’s adding as a CSV row as it works through! Only tested on Mac so far, so all feedback welcome! Hope it’s helpful!
Note: if you add a trailing slash to your supplied folder path when you call the script, the CSV will get dropped inside the folder when the process has completed. If you leave the slash off it will be placed alongside your file.

ezpy_metadata_to_csv.py

This script is for learners using ezpy.io to create metadata into text files from our JSON examples. Once you have a couple of files created, double check they are in the same current working folder your interpreter is in, and see if you find something along the lines of this:

import os
print(os.listdir())

["metadata1.txt", "metadata2.txt", "main.py"]

Don’t worry about ‘main.py’, that is where your ezpy code is stored for each run you make! We are only interested in the creation of your metadata text files. If you have none because you reset your browser then head back up to the earlier code and recreate some metadata!

Now you can clear your ezpy.io window using the green loop button, copy the text from ezpy_metadata_to_csv.py and paste it into your left window, then run it! Hopefully you get something like this:

To be absolutely sure you have a CSV file made, run your print(os.listdir()) command one more time, and you should have a “metadata.csv” file listed alongside your metadata text files.

Congratulations! You ran a metadata extraction script on a web interpreter! When you get Python and MediaInfo installed on a laptop/computer then give get_metadata.py a go!

Happy metadata extraction using Python!

Useful links:

The BFI open-source scripts are available at bfidatadigipres Github. Please take a look!

PyPi.org hosts Python projects that you can install and run in your python code. In Feb 2025 a new project was released call PyMediaInfo which looks like a great tool for accessing metadata in python. You still need MediaInfo CLI installed, but will cut out some of the steps above if you use it!

There’s also a similar tool called PyExifTool, that lets you call ExifTool metadata directly into your code too. Like MediaInfo you need ExifTool installed before you can use this!

I’m aware this blog is a little light on Python code context, so I recommend you check out the Women Who Code Python Study Group to help you learn Python! This video playlist introduces you to basic and more complex Python code – these women are so impressive and prolific open educators!